REPORT  DOCUMENTATION  PAGE 


Form  Approved 
0MB  No.  0704^188 


Puoiic  reDorxtna  ouraen  tor  tnts  coireaion  ot  mTormation  n  Mtimatw  to  aversqe  i  nour  oer  resoonse.  tnciuoing  me  time  tor  revtewing  initruaions.  tearcmnq  eitsTing  oata  sourer 
gathenna  ano  mamtaintnq  the  oata  neeoeo,  ano  comotettng  ano  reviewmo  me  coiieaion  ot  intormation.  Seno  comment*  regaraing  thi*  ouroen  estimate  or  any  other  asocct  ot  tn 
coHeaion  ot  mtormation.  including  suggestions  tor  reducing  this  ouroen.  to  Washington  Heaoauarten  Service*.  Directorate  hr  information  Ooerations  and  Reoorr*.  t2i5  jefteno 
Davis  Highvyav.  Suite  1204.  Arlington.  VA  22202*4302.  and  to  the  Off  ice  ot  Manaoement  and  Budget.  Paoerworii  «eduaion  Profect(0  704-01 88).  Wasnmgton.  DC  20503. 


1.  AGENCY  USE  ONLY  (leave  Pfank) 


2.  REPORT  DATE 

June  1995 


4.  TITLE  ANO  SUBTITLE 

Conference  on  Computing  Science  &  Statistics 
Symposium  on  the  Interface 


3.  REPORT  TYPE  ANO  OATES  COVERED 

Final  15  May  94  -  14  May  95 


5.  FUNDING  NUMBERS 


6.  AUTHOR(S) 

Edward  Wegman 


(principal  investigator) 


DAAH04-94~G-0222 


8.  PERFORMING  ORGANIZATION 
^  REPORT  NUMBER 


10.  SP^SORING/ MONITORING 
^.^r^ENCY  REPORT  NUMBER 


ARO  32588. 1-MA-CF 


11.  SUPPLEMENTARY  NOTES 

The  views,  opinions  and/or  findings  contained  in  this  report  are  those  of  the 
author (s)  and  should  not  be  construed  as  an  official  Department  of  the  Army 
position,  policy,  or  decision,  unless  so  designated  by  other  documentation. 


12a.  DISTRIBUTION /AVAILABILITY  STATEMENT  I  12b.  DISTRIBUTION  CODE 


9.  SPONSORING /MONITORING  AGENCY  NAME(S)  AND  AODRMW 

U.S.  Army  Research  Office 

P.O.  Box  12211 

\ 

Research  Triangle  Park,  NC 

27709-2211  ^ 

Approved  for  public  release;  distribution  unlimited. 


13.  ABSTRACT  (Maximum  200  words) 


The  26th  Symposium  on  the  Interface  of  Computing  Science  and  Statistics  was  held  on  June  15-18, 1994  at  the 
Sheraton  Imperial  Hotel  in  Rese^h  Triangle  Park,  NC.  The  conference  theme  was  ''Computationally  Intensive 
Statistical  Methods,  The  theme  is  especially  appropriate  as  computational  power  has  increased  dramatically  in 
the  last  few  years  and  the  use  of  resampling  techniques  has  boomed. 

The  Interface  was  scheduled  between  two  other  statistics  conferences  in  the  same  area:  the  Spring  Research 
!  Conference  on  Statistics  in  Industry,  hosted  by  the  National  Institute  of  Statistical  Sciences,  and  the  Third 
World  Congress-Bemoulli  Society-IMS  meetings,  at  the  University  of  North  Carolina  in  Chapel  Hill. 

The  conference  attracted  365  attendees.  There  were  23  invited  sessions,  21  contributed  paper  sessions,  9  poster 
presentations,  4  short  courses,  2  practical  tutorials,  several  statistical  tutorial  sessions,  one  keynote  speech,  one 
banquet  presentation,  and  2  tours. 


14.  SUBJECT  TERMS 

15.  NUMBER  OF  PAGES 

16.  PRICE  CODE 

17.  SECURITY  CUSSIFICATiON 

18.  SECURITY  CLASSIFICATION 

19.  SECURITY  CLASSIFICATION 

20.  LIMITATION  OF  ABSTRACT 

OF  REPORT 

OF  THIS  PAGE 

OF  ABSTRACT 

UNCLASSIFIED 

UNCLASSIFIED 

UNCLASSIFIED 

UL 

NSN  7540-01-280-5500 


Stanaard  Form  298  (Rev.  2-89) 

.ML.  10 


Computing  Science 

and  Statistics  Volume  26 


Computationally  Intensive 
Statistical  Methods 

Editors 

John  Sail 
Ann  Lehman 

Proceedings  of  the 
26th  Symposium  on  the  Interface 


19950703  216 

INTERFACE  FOUNDATION  OF  NORTH  AMERICA 


DTIC  QUAUTTY  mEPBOTED  3 


PUBLISHER'S  FORWARD 


Notices 

The  papers  in  this  volume  are  printed  exactly  as  they  were  submitted  as  a  record  of  the  conference  and  are 
reproduced  as  received  from  the  authors.  These  presentations  are  presumed  to  be  essentially  as  given  at  the 
26th  Symposium  on  die  Interface.  The  papers  have  not  been  reviewed  and  no  claims  are  made  by  the  editors 
or  publishers  as  to  the  originality  or  accuracy  of  their  contents. 

This  volume  is  not  copyrighted  by  the  Interface  Foundation  of  North  America,  Inc.  although  individual  items 
may  be  copyrighted  by  their  authors.  If  no  copyright  notice  is  indicated,  it  is  presumed  that  the  author(s)  have 
not  copyrighted  their  material  and  that  you  may  freely  copy  the  contents  from  this  volume  provided  that  you  cite 
the  source.  Publication  in  this  volume  does  not  preclude  authors  from  submitting  the  papers  to  other 
publications. 

An  example  of  the  recommended  citation  of  articles  from  this  publication  is 

Heavlin,  W.D,  and  Finnegan,  G.P.  (1994),  "Dual  space  algorithms  for  designing  space-filling 
experiments,"  Computing  Science  and  Statistics y  26,  41-47. 

If  more  details  are  required,  the  editors  and  die  publisher  (Interface  Foundation  of  North  America,  Inc.)  may  be 
added. 

Purchase  of  Previous  Volumes  ^ 

You  may  purchase  this  Volume  and  Volumes  20  through  25  (1988  through  1993)  from .  -  * 

Interface  Foundation  of  North  America,  Inc.  ’  ; 

P.O.Box  7460  =  •  • 

Fairfax  Station,  VA  22039-7460.  -  ^  ; 

Volume  22  (1990)  is  also  available  from  T 

Springer-Verlag,  New  York,  Inc.  ‘ 

175  Fifth  Avenue  f . 

New  York,  NY  10010-3402 

Volumes  18,  19,  20  and  21  (1986-1989)  are  available  from 
American  Statistical  Association 

1429  Duke  Street  ^ 

Alexandria,  VA  22314-3402 

Interface  '95 

Please  plan  to  attend  the  next  Interface  Symposium  scheduled  for  June  21-24  in  Pittsburgh.  It  will  be  hosted  by 
Carnegie  Mellon  University  and  the  Pennsylvania  State  University  with  Michael  Meyer  and  James  Rosenberger 
as  joint  program  chairs.  For  details: 

email:  interface95@stat.cmu.edu 

phone:  (412)  268-3108  fax:  (412)  268-7828 

mail:  Interface  *95 

Department  of  Statistics 
Carnegie  Mellon  University 
5000  Forbes  Avenue 
Pittsburgh,  PA  15213  USA 

Interface,  Interface  '94,  Interface  '95,  Computing  Science  and  StatisticSy  and  the  triangle  logo  are  trademarks 
of  the  Interface  Foundation  of  North  America,  Inc. 

ISBN  1-886658-00-5 

PRINTED  IN  THE  UNITED  STATES  OF  AMERICA  (1994) 


PREFACE 

1994  Interface  Proceedings 


The  26th  Symposium  on  the  Interface  of  Computing  Science  and  Statistics  was  held  on  June  15-18, 1994  at  the 
Sheraton  Imperial  Hotel  in  Research  Triangle  Park,  NC.  The  conference  theme  was  "Computationally  Intensive 
Statistical  Methods."  The  theme  is  especially  appropriate  as  computational  power  has  increased  dramatically  in 
the  last  few  years  and  the  use  of  resampling  techniques  has  boomed. 

The  Interface  was  scheduled  between  two  other  statistics  conferences  in  the  same  area:  the  Spring  Research 
Conference  on  Statistics  in  Industry,  hosted  by  the  National  Institute  of  Statistical  Sciences,  and  the  Third 
World  Congress-Bemoulli  Society-IMS  meetings,  at  the  University  of  North  Carolina  in  Chapel  Hill. 


The  conference  attracted  365  attendees.  There  were  23  invited  sessions,  21  contributed  paper  sessions,  9  poster 
presentations,  4  short  courses,  2  practical  tutorials,  several  statistical  tutorial  sessions,  one  keynote  speech,  one 
banquet  presentation,  and  2  tours. 


Conference  Events 

The  conference  started  Wednesday  afternoon  with  4  short  courses,  followed  by  a  mixer  that  evening.  The  short 
courses  were  organized  by  Tom  Devlin,  who  is  continuing  education  coordinator  for  the  Statistical  Computing 
Section  of  ASA.  The  courses  were:  Modem  Nonparametric  Regression  and  Classification,  by  Trevor  Hastle 
and  Rob  Tibshirani,  Resampling-Based  Multiple  Testing,  by  P.  H.  Westfall  and  S.  S.  Young,  Algorithms  for 
Estimation  and  Visualization  of  Multivariate  Density  Functions  with  Applications  to  Clustering,  by  David  W. 
Scott ,  and  Data  Analysis  using  Interactive  Dynamic  Graphics:  An  Introduction  to  XGobi,  by  Di  Cook,  Martin 
Koschat,  and  Deborah  Swayne. 


On  Thursday  morning,  the  keynote  address  was  presented  by  G.  W.  “Pete”  Stewart  professor  in  the  Computer 
Science  Department  and  Research  Professor  in  the  Institute  for  Advanced  Computer  Studies  at  the  University  of 
Maryland.  Pete  talked  about  "Gauss,  Statistics,  and  Gaussian  Elimination,”  in  which  Gauss  is  seen  as  a 
statistician  inventing  numerical  methods  in  the  service  of  fitting  data.  Pete  Stewart  is  a  well-known  authority  in 
the  field  of  numerical  linear  algebra.  Originally  a  student  of  Alston  Householder,  he  is  the  author  of  over  ninety 
papers  on  various  aspects  of  numerical  analysis  and  matrix  computation.  His  books  include  Introduction  to 
Matrix  Computation  and,  with  J.  G.  Sun,  Matrix  Perturbation  Theory.  He  is  a  co-author  of  the  LINPACK 
package  for  linear  algebra.  Pete  was  introduced  by  Bob  Funderlic,  North  Carolina  State  University. 

On  Thursday  evening,  there  were  tours  to  the  UNC  Graphics  and  Image  Lab  in  Chapel  Hill,  and  to  SAS 
Institute  in  Cary.  The  feature  at  the  UNC  lab  was  virtual  reality  and  the  Pixelplanes  5  parallel  graphics 
computer.  The  feature  at  SAS  Institute  was  the  new  400,000  square  foot  research  building. 

On  Friday,  a  banquet  dinner  was  held  with  music  by  the  Bluegrass  Retreat.  Interface  business  manager  Ruth 
Lee  played  bass  guitar.  Dinner  was  followed  by  a  presentation  on  computer  animation  by  Wayne  Lytle,  an 
award-winning  computer  graphics  animator  from  the  Cornell  University  Theory  Center.  Wayne’s  presentation 
featured  scientific  animations  describing  the  recent  breakthrough  discovery  of  planets  in  a  distant  star  system. 
Particularly  enjoyable  were  a  humorous  animation  on  glitziness  overload  in  scientific  presentations,  and  a 
segment  on  music  animation. 


The  Conference  Organization 

Interface  Conferences  are  sponsored  by  the  Interface  Foundation  of  North  America.  IFNA  is  a  nonprofit 
educational  corporation  founded  in  1987  to  sponsor  the  symposium  and  publish  the  proceedings.  IFNA  also  co¬ 
publishes  the  Journal  of  Computational  and  Graphical  Statistics. 


iii 


The  conference  is  undertaken  with  the  support  and  cooperation  of  the  following  societies:  the  American 
Statistical  Association,  the  Institute  for  Mathematical  Statistics,  the  International  Association  for  Statistical 
Computing,  the  Society  for  Industrial  and  Applied  Mathematics,  and  the  Operations  Research  Society  of 
America. 


SAS  Institute  hosted  this  year's  conference,  with  John  Sail  serving  as  program  chair.  SAS  Institute  is  a 
software  company  specializing  in  statistical  computing,  and  is  located  in  nearby  Cary,  NC.  SAS  Institute 
provided  personnel  and  services  free  of  charge  for  the  meeting. 

The  program  committee  and  session  organizers  were  Stephen  G.  Eick,  J.  S.  Matron,  Russ  Wolfinger,  Sally 
Morton,  Mike  West,  S.  Stanley  Young,  Raoul  LePage,  Ron  Gallant,  Alex  Georgiev,  Bill  DuMouchel,  Cyrus  R. 
Mehta,  Chris  Portier,  Ed  Wegman,  David  Rocke,  Iain  Johnstone,  Peter  Munson,  Tim  Hesterberg,  Richard 
Smith,  Francoise  Seillier-Moiseiwitsch,  Forrest  Young,  and  John  Elder.  Featured  speakers  included  Adrian 
Smith,  Andrew  Barron,  and  Mary  Ellen  Bock.  Additional  tutorials  were  given  by  Tim  Arnold  and  Phil  Spector. 

Session  chairs  included  Jianqing  Fan,  Lisa  LaVange,  James  L.  Rosenberger,  Mark  Little,  Nick  Fisher,  Ming 
Tan,  Wolfgang  Hartmann,  John  Elder,  Dave  Dickey,  Karen  Kafadar,  Phil  Spector,  Warren  Sarle,  John  Nash, 
George  Guirguis,  A1  Best,  Forrest  Young,  Alan  Genz,  Phil  Spector,  Mary  Ellen  Bock,  Cyrus  R.  Mehta,  Chris 
Portier,  Leonard  B.  Heame,  Bill  Kemple,  Feng  Gao,  Ying  So,  Deborah  Swaine,  Dennis  Boos,  and  Gordon 
Johnston. 

Outside  of  the  program,  the  people  that  put  the  conference  together  were:  Ruth  Lee,  conference  business 
manager,  Susan  Byrd,  hotel  coordinator,  Armistead  Sapp,  equipment  manager,  Jane  Pierce,  abstracts  editor, 
Stefanie  Barber  Mueller,  Kristin  Rinne,  Marybeth  Mahoney,  Curt  Yeo  and  SAS  Institute  Copy  Center,  for 
graphic  arts,  Lynn  Fountain,  Chris  Gilmore,  Bob  Rodriguez  for  the  SAS  tour,  Linda  Houseman  for  the  UNC 
tour.  Interpath  provided  Internet  connections.  The  IFNA  head  office  with  Ed  Wegman  and  Pat  Joyce  did  the 
printing,  mailing,  grant  administration,  and  accounts  payable. 


John  Sail  and  Ann  Lehman 
Editors 


Please  plan  to  attend  the  next  Interface  Conference,  scheduled  for  June  21-24  in  Pittsburgh.  It  will  be  hosted  by 
Carnegie  Mellon  University  and  the  Pennsylvania  State  University  with  Michael  Meyer  and  James  Rosenberger 
as  joint  program  chairs.  For  details: 
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1.  Introduction 

Everyone  knows  that  Gauss  invented  Gaussian  elimina¬ 
tion,  and,  excepting  a  quibble,  everyone  is  right. ^  What 
is  less  well  known  is  that  Gauss  introduced  the  proce¬ 
dure  as  a  mathematical  tool  to  get  at  the  precision  of 
least  squares  estimates.  In  fact  the  computational  com¬ 
ponent  in  the  original  description  is  so  little  visible,  that 
it  takes  some  doing  to  see  an  algorithm  in  it. 

Gaussian  elimination,  therefore,  was  not  conceived  as 
a  general  numerical  algorithm  with  applications  in  statis¬ 
tics  and  least  squares.  Rather  it  was  a  procedure  that 
sprang  from  the  interface  of  statistics  and  computation. 
Since  the  full  story  is  known  only  to  the  few  who  have 
consulted  the  original  sources,  I  hope  my  readers  will  be 
interested  to  see  how  Gauss  did  things.  But  there  is  more 
than  the  satisfaction  of  idle  curiosity  here.  Gauss  and 
Laplace  were  the  premier  statisticians  of  their  day,  and 
Gauss  alone  was  the  premier  numerical  analyst.  Today 
we  still  have  something  to  learn  from  observing  Gauss's 
practices. 

2.  Chronicles 

The  principle  of  least  squares  arose  from  the  problem 
of  combining  sets  of  overdetermined  equations  to  form 
a  square  system  that  could  be  solved  for  the  unknowns. 
The  problem  went  under  the  name  of  the  combination 
of  observations,  and  has  been  well  surveyed  by  Stigler 
[23]  in  his  History  of  Siaiisiics.  By  way  of  background,  I 
will  relate  in  chronological  order  the  major  events  in  the 
story  of  least  squares,  from  Gauss's  first  discovery  to  his 
final  treatment  in  the  1820 's. 

In  his  correspondence,  Gauss  asserted  that  he  had  dis¬ 
covered  the  principle  of  least  squares  in  1824  (or  1825, 
the  dates  vary).  Gauss  seems  to  have  had  little  regard 
for  the  principle  itself,  and  even  said  he  thought  others 
must  have  used  it  before  him.  In  June  of  1828  Gauss 
[11,  V.  10]  made  the  following  entry  in  the  little  diary 
of  discoveries  he  kept  from  1796  to  1814:  “Probability 

^The  quibble  is  that  in  1759,  in  the  very  first  paper  to  appear 
in  his  collected  works  [14],  Lagrange  gave  the  basic  computational 
formulas  for  Gaussian  elimination.  His  purpose,  however,  was  to 
determine  if  a  critic^vl  point  was  a  minimum,  not  to  solve  linear 
equations.  There  is  no  indication  that  the  paper  had  any  influence 
on  Gauss,  or  anyone  else. 


calculus  defended  against  Laplace."^  Laplace,  following 
Boscovich  [1,  16],  had  suggested  that  observations  be 
combined  by  minimizing  the  sum  of  the  absolute  values 
of  the  residuals  subject  to  the  condition  that  the  resid¬ 
uals  sum  to  zero.  Gauss  felt  that  this  way  of  combining 
observations  violated  the  dictates  of  probability  theory, 
and  his  alternative  was  the  first  probabilistic  justification 
of  least  squares. 

The  following  entry  in  the  diary,  also  dated  June  1898, 
contains  the  statement:  “The  problem  of  elimination  re¬ 
solved  in  such  a  way  that  nothing  more  can  be  desired.”^ 

I  take  this  entry  to  be  the  first  reference  to  Gaussian 
elimination.  But  a  decade  was  to  pass  before  Gauss  pub¬ 
lished  either  the  probabilistic  justification  or  the  elimi¬ 
nation  procedure. 

Although  we  tend  to  regard  Gauss  chiefly  as  a  math¬ 
ematician,  it  was  as  an  astronomer  that  he  first  made 
his  mark.  On  New  Year's  Day  of  1801,  the  astronomer 
Piazzi  discovered  the  asteroid  Ceres.  The  new  planet  be¬ 
came  unobservable  after  only  nine  degrees  of  an  arc  had 
been  recorded,  and  astronomers  were  faced  with  problem 
of  determining  where  to  look  for  it  next.  Gauss  under¬ 
took  the  calculation,  using  new  techniques  in  physical 
astronomy  and  presumably  his  principle  of  least  squares. 
At  the  end  of  1801,  he  predicted  where  in  the  heavens  the 
asteroid  would  be  found,  and  his  reputation  was  made. 

Gauss,  who  was  generally  slow  to  publish,  began  work 
in  1805  on  his  Theoria  Moius  Corporum  Coelesiium,  in 
which  he  described  his  techniques  for  computing  orbits 
and  gave  his  first  probabilistic  justification  of  the  prin¬ 
ciple  of  least  squares.  He  finished  in  1806,  but  his  pub¬ 
lisher,  worried  by  German  losses  to  Napolean,  insisted  he 
translate  the  treatise  into  Latin.  In  consequence  it  did 
not  appear  until  1809  [2].  In  the  meantime,  Legendre 
[20]  published  and  named  the  method  of  least  squares 
{la  meihode  des  moindres  quarres)  in  an  appendix  to  a 
memoir  appearing  in  1805.  When  the  Theoria  Moius  fi¬ 
nally  appeared,  Legendre  found  that  Gauss  had  claimed 
the  principle  for  his  own,  and  he  took  exception.  The 
result  was  a  priority  dispute,  which  need  not  concern  us 
here.  ^ 

^In  the  original  Latin:  Calculus  prohabiliiatis  contra  La  Place 
dejensus. 

^Problema  eliminationis  iia  aoluium,  ui  nihil  amplius  desider- 
ari  possit. 

^Placket  [21]  gives  beflanced  survey  with  translations  from 
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In  the  Theoria  Moius,  Gauss  had  assumed  the  errors 
in  the  observations  were  normally  distributed.  In  1811, 
Laplace  [17]  his  central  limit  theorem  give  an  essentially 
different  justification  of  least  squares.  This  is  not  the 
place  to  enter  into  details,  but  briefly  Laplace  showed 
that  the  solutions  of  a  combination  of  equations  were 
asymptotically  normal  and  from  this  concluded  that  the 
least  squares  combination  would  minimize  the  mean  ab- 
solutejerror  in  the  solutions.  Laplace’s  approach  does 
not  readily  extend  beyond  two  unknowns. 

The  final  chapter  occurred  in  the  1820’s  when  Gauss 
[5,  6,  8]  published  two  memoirs  on  least  squares.  The 
first,  in  two  parts,  contains  yet  another  justification  of 
least  squares — Gauss’s  famous  minimum  variance  theo¬ 
rem,  These  papers  also  contain  some  nice  algorithmics, 
which  will  concern  us  later. 

3.  The  Precision  of  Estimates 

The  first  appearance  of  Gaussian  elimination  in  print 
occurs  in  Section  182  of  the  Theoria  Motus,  In  order  to 
understand  what  Gauss  is  about,  we  will  have  to  sketch 
some  background. 

Gauss  (after  a  linearization)  considers  the  model® 
y  =  Xb  -h  e, 

where  X  is  n  x  p.  The  errors  e,*  are  assumed  to  be  in¬ 
dependent  randome  variables  with  common  distribution 
(p{e).  Gauss  introductes  the  function 

-  x?'b)v>(j/2  -  xjb)  •  •  •  (p{yn  -  x^b),  (3.1) 

where  the  are  the  rows  of  X  and  uses  a  Bayesian 
argument  with  a  uniform  prior  to  argue  that  the  value 
of  b  that  maximizes  (3.1)  is  the  most  probable  value  of 
the  unknowns. 

Gauss  now  supposes  the  distribution  of  the  e,*  is  nor¬ 
mal;  that  is,  ^(e)  oc  e”*  ®  .  He  identifies  the  parameter 
h  with  the  precision®  of  y.  The  function  (3.1)  now  be¬ 
comes  proportional  to 

(3.2) 

where 

fl  =  (y  -  Xb)'^(y  -  Xb) 

is  the  residual  sum  of  squares.  Thus,  Gauss’s  most  prob¬ 
able  value  is  obtained  by  minimizing  the  residual  sum 


of  squares,  which  justifies  the  principle  of  least  squares. 
The  normal  equations  can  be  derived  as  usual  by  differ¬ 
entiation. 

Gauss  next  turns  to  the  problem  of  estimating  the 
precision  of  the  least  squares  estimates.  His  technique 
is  to  integrate  all  but  the  last  unknown  out  of  (3.2), 
after  which  the  precision  can  be  read  off.  However,  to 
perform  the  required  integrations  Q  must  be  expressed 
in  a  special  form,  and  the  tool  for  arriving  at  that  form 
is  Gaussian  elimination. 

The  procedure  as  given  by  Gauss  is  the  following.  Let 

1  -  L 

^1  ^  _|.  ^^262  + - 1-  ripbp  -  si, 


and  let 


=  Q  — 


Then  clearly  the  derivative  of  with  respect  to  61  is 
zero,  so  that  (1%  is  independent  of  61. 

One  more  step  will  illustrate  the  general  procedure. 
Set 

IdQi  _  ^ 

«2  =  2~d^  “  ^2202  +  r23h  H - b  r2pOp  —  S2- 

Then  ^ 

^2  =  - - 

r22 

is  independent  of  61  and  62 »  Continuing  in  this  manner 
we  arrive  at  the  decomposition 

m2  *,2 

rn  r22  Vpp 

in  which  is  independent  of  61, . . .  ,6,-1  and  p  is  con¬ 
stant. 

Gauss  now  considers  the  expression 


=  exp(— )  •  exp(— )  •  •  •exp(— ^). 


and  integrates  with  respect  to  bi  over  the  real  line.  Since 
the  last  p  —  1  factors  in  this  expression  are  free  of  61 , 
they  remain  unchanged  by  the  integration.  The  first 
factor  integrates  to  a  constant.  Thus  Gauss  is  left  with 
a  distribution  proportional  to 

^22  ^pp 


Gauss’s  correspondence. 

*  We  will  make  free  use  of  matrices  in  what  follows,  but  only  as 
means  of  abbreviating  Gauss’s  scalar  equations. 

®We  must  not  use  terms  like  variance  or  standard  deviation 
here.  The  number  h  is  simply  a  parameter  in  a  specific  distribu¬ 
tion.  Only  in  the  Theoria  Combinaiionis  will  Gauss  introduce  the 
second  moment  of  a  genered  distribution  as  a  measure  of  variation 


which  is  free  of  61 .  Continuing  this  process  of  integrating 
out  the  parameters  bi,  Gauss  finds  that  the  distribution 
of  bp  is  proportional  to 


G,W.  Stewart  3 


where 


Up  —  rpphp  —  Sp^ 


Gauss  concludes  that  the  most  probable  value  of  ob¬ 
tained  by  setting  =  0,  is 


and  its  precision  is 


Gauss  now  goes  on  to  show  that  if  you  write  the  nor¬ 
mal  equations  in  the  form 

Ab  =  c  (3.3) 

and  express  b  as  a  function  of  c  in  the  form 

b  =  Vc,  (3.4) 

then  the  (p,p)-element  of  V  is  Since  the  resulting 
expression  for  the  precision  clearly  does  not  depend  on 
the  position  of  the  unknown,  Gauss  concludes  that  the 
precision  of  any  of  the  estimates  6,*  is  hy/vU- 

It  is  ironic  that  the  Theoria  Motus  should  have  become 
the  principle  reference  for  Gaussian  elimination  as  a  com¬ 
putational  tool.  As  we  have  seen,  Gauss  used  elimination 
to  give  a  derivation  of  one  of  the  most  important  results 
of  linear  regression  theory.  He  was  certainly  aware  of 
the  computational  consequences  of  his  elimination  pro¬ 
cedure,  and  promises  to  describe  them  in  a  later  work. 
But  computational  considerations  are  absent  from  the 
Theoria  Motus  itself.  Gauss  merely  points  out  that  the 
normal  equations  can  be  solved  by  ordinary  elimination 
{eliminatio  vulgaris) ^  presumably  a  variant  of  what  we 
now  call  Gauss-Jordan  elimination.  An  extension,  which 
Gauss  will  later  call  general  elimination  {eliminaiio  in¬ 
definite)  j  can  be  used  to  pass  from  the  normal  equations 
(3.3)  to  the  inverse  system  (3.4). 

4.  The  Scalar  Connection 

In  1810,  in  Disquisitio  de  Elemeniis  Ellipiicis  Palladis 
[3],  Gauss  gave  the  numerical  details  of  his  algorithm 
and  illustrated  it  with  an  example.  The  formulas  can 
be  derived  by  observing  that  a  homogeneous  quadratic 
form  is  determined  by  its  matrix  of  second  derivatives. 
Specifically,  if  we  set 

_  1  d^Q 

~  2dbidbj  ’ 

then  it  follows  from  the  formula 

^  ^  1  an 


(1)  _  1  d^Cli  _  QjlQij 

an  ‘ 

In  the  expression  on  the  right,  we  recognize  the  formulas 
for  performing  one  step  of  Gaussian  elimination,  as  we 
understand  it  today,  on  a  matrix  whose  elements  are  Oy . 
This  is  essentially  the  algorithm  Gauss  describes  in  the 
Disquisitio. 

To  complete  the  solution  of  the  normal  equations  by 
Gaussian  elimination,  note  that  since 

n=_L  +  _l4....+  -^  +  p, 

rii  r22  rpp 

the  function  Q  assumes  its  minimum  value  p  when 

txi  =  U2  =  •  •  •  =  Up  =  0. 


0  —  Up  —  rpphp  —  Sp 

is  a  linear  equation  involving  only  6p,  it  can  be  solved 
immediately  for  6p.  Having  determined  6p,  one  can  solve 
for  6p_i  from  the  equation 

0  “  Up— 1  =  Tp— ijp— l6p_l  "b  ^p— ^p— !• 

Continuing  in  this  manner,  we  can  determine  estimates 
for  all  the  unknown  bp.  This  of  course  is  nothing  more 
than  the  back  substitution  phase  of  Gaussian  elimina¬ 
tion. 

5.  The  Matrix  Connection 

The  above  description  of  the  algorithm  is  incomplete,  in 
the  sense  that  it  does  not  give  formulas  for  the  constant 
parts  Si  of  the  functions  «,*.  To  see  where  they  come 
from,  it  will  be  useful  to  express  the  algorithm  in  terms 
of  matrices. 

The  function  Q  can  be  written  in  the  form 


;)(i) 


If  we  set 


rii  ri2 

0  r22 


rip 
■  •  •  r2p 


and  8  = 


then  it  is  easy  to  verify  that 


fA  c\  _  fR'^  0\  fD-^  A 

„;-isT  pjv  0  p-vvo  PJ’ 
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where 

D  =  diag(ru,r22,...,^pp)- 

Thus  Gaussian  elimination,  as  practiced  by  Gauss, 
amounts  to  factoring  the  augmented  cross-product  ma¬ 
trix  into  a  lower  triangular  matrix,  a  diagonal  matrix, 
and  the  transpose  of  the  lower  triangular  matrix.  It 
is  common  practice  today  to  work  with  the  augmented 
cross-product  matrix. 

The  vector  u  whose  components  are  the  functions 
can  be  written  in  the  form 

u  =  Rb  —  s. 

The  process  sketched  above  of  setting  the  «,*  to  zero  and 
back-solving  amounts  to  solving  the  triangular  system 

Rb  =  s. 

6,  The  Computation  of  Variances 

Writing  in  1821,  Gauss  [4]  summarized  his  and  Laplace’s 
justifications  of  least  squares  as  follows. 

From  the  foregoing  we  see  that  the  two  justifi¬ 
cations  each  leave  something  to  be  desired.  The 
first  depends  entirely  on  the  hypothetical  form 
of  the  probability  of  the  error;  as  soon  as  that 
form  is  rejected,  the  values  of  the  unknowns 
produced  by  the  method  of  least  squares  are 
no  more  the  most  probable  values  than  is  the 
arithmetic  mean  in  the  simplest  case  mentioned 
above.  The  second  justification  leaves  us  en¬ 
tirely  in  the  dark  about  what  to  do  when  the 
number  of  observations  is  not  large.  In  this 
case  the  method  of  least  squares  no  longer  has 
the  status  of  a  law  ordained  by  the  probabil¬ 
ity  calculus  and  has  only  the  simplicity  of  the 
operations  it  entails  to  recommend  it. 

In  the  Pars  Prior  of  his  memoir  Theoria  Combinationis 
Observaiionum  Erroribus  Minimis  Obnoxiae  [7],  Gauss 
resolve^  the  dilemma  by  introducing  the  notion  of  mean 
square  error  as  a  measure  of  variance  and  showing  that 
among  all  linear  combinations  of  the  observations  that 
produced  exact  estimates  in  the  absence  of  error  the  least 
squares  estimates  have  least  mean  square  error. 

In  the  Pars  Posterior  of  the  Theoria  Combinationis 
[6],  Gauss  addresses  the  problem  of  computing  variances. 
He  points  out  that  his  elimination  method  gives  only  the 
variance  of  the  last  unknown.  Since  (he  continues)  a  gen¬ 
eral  elimination  to  invert  the  normal  equations  is  expen¬ 
sive,  some  calculators  have  adopted  the  practice  of  per¬ 


forming  the  elimination  with  another  unknown  placed 
last.^  Gauss  says  that  he  will  give  a  better  way. 

Gauss  actually  gives  two  solutions  to  the  problem.  In 
the  first  he  shows  that  if  one  inverts  the  system  Rb  =  s 
to  get  Ts  =  b,  then  the  matrix  V  obtained  by  passing 
from  (3.3)  to  (3.4)  can  be  written 

V  =  TDT^. 

Thus  the  diagonal  elements  of  V  can  be  computed  as  a 
weighted  sum  of  squares  of  the  rows  of  T.  Gauss  gives 
two  algorithms  for  computing  T,  one  of  them  particu¬ 
larly  advantageous  when  only  a  few  variances  are  to  be 
computed. 

The  second  method  is  a  very  general  result  for  com¬ 
puting  the  variance  of  an  arbitrary  linear  combination 

t  =  g^h  -b  K 

of  the  unknowns  b.  Specifically,  if  we  pass  from  the 
variables  b  to  the  variables  u,  so  that  t  assumes  the 
form 

t  =  h^u  -h  i, 

then  i  is  the  value  of  t  at  the  least  squares  estimates  of 
the  unknowns,®  and  its  variance  is  proportional  to 

h^Dh. 

Moreover,  h  may  be  obtained  by  solving  the  triangular 
system 

K^h  =  g. 

Thus  Gauss  reduces  the  problem  of  computing  a  variance 
to  that  of  solving  a  triangular  system. 

A  modern  practice  in  numerical  linear  algebra  is  to 
compute  a  matrix  decomposition  and  then  use  it  in  a 
variety  of  computations.  Although  it  would  be  anachro¬ 
nistic  to  call  Gauss  a  decompositionalist,  he  calculated 
like  one.  The  results  of  his  elimination  serve  as  a  com¬ 
putational  platform  from  which  both  estimates  and  vari¬ 
ances  can  be  obtained. 

7.  Computational  Complexity 

Did  Gaussian  elimination  represent  an  improvement  over 
the  practices  of  the  day?  If  we  assume  that  people  were 
using  Gauss-Jordan  elimination  to  solve  systems,  they 
would  have  performed  roughly  multiplications  and 

^Laplace,  for  example,  recommended  a  similar  procedure  in  the 
first  supplement  to  his  Theorit  Analyiique  ies  ProhaUliUs  [18]. 

^It  has  been  asserted  [22]  that  Gauss  established  that  i  enjoyed 
the  same  minimum  variance  properties  as  the  components  of  b. 
Although  the  result  is  true,  Gauss  never  proved  it. 
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about  the  same  number  of  additions.  Gaussian  elimina¬ 
tion,  on  the  other  hand,  requires  about  multiplica¬ 
tions  and  additions.  Thus  Gaussian  elimination  repre¬ 
sents  an  improvement  of  a  factor  of  about  three. 

If  variances  are  required,  the  inversion  of  the  normal 
equations  by  Gauss-Jordan  elimination  would  cost  an 
additional  multiplications  and  additions  for  a  total 
of  |p^.  With  Gausses  approach  the  total  is  |p^,  an  im- 
proveiiient  by  a  factor  |. 

In  an  age  in  which  a  workstation  can  solve  a  system 
of  oxdh  100  with  barely  a  hiccup,  it  is  easy  to  be  cav¬ 
alier  about  factors  of  three.  To  see  what  it  might  have 
meant  to  people  who  had  to  do  their  calculations  by 
hand,  consider  the  following  quote  from  A  Treatise  on 
the  Adjustment  of  Observations  published  in  1884  by  T. 
W.  Wright  [24,  p.  173]: 

Dr.  Hiigel,  of  Hessen,  Germany,  states  that  he 
has  solved  10  normal  equations  in  from  10-12 
hours,  using  a  log.  table,  but  that  29  equations 
took  him  seven  weeks. 

Without  Gaussian  elimination  Dr.  HiigePs  twelve  hours 
would  have  stretched  to  a  day  and  a  half,  and  his  seven 
weeks  to  almost  half  a  year. 


8.  Notation 

Gauss,  like  most  mathematicians  of  his  time,  made  spar¬ 
ing  use  of  subscripts  and  superscripts,  prefering  to  use 
primes  or  sequences  of  letters  to  distinguish  variables. 
For  example,  Gauss  writes  his  linear  model  in  the  form 

V  ^  ax  ‘i' by  +  cx  +  etc.  +  / 

V*  =  a'x  +  t'p  +  c'a:  -h  etc.  -h  /' 

=  a"®  +  V^y  -h  d^x  -f-  etc.  +  V*  etc. 

Here  x,  y,  z,  etc.  are  the  unknowns  we  have  been  de¬ 

noting  by  bi  and  the  v’s  are  the  errors.  Although  this 
expansive  notation  appears  awkward  to  us,  in  Gauss’s 
hands  it  could  be  quite  expressive.  For  example,  here 
(slightly  edited)  is  how  he  writes  the  normal  equations. 

0  =  [aa]x  +  [ah]y  +  [ac]z  -1-  etc.  +  [a/] 

0  =  [ab]x  +  [bb]y  -|-  [bc]z  -f  etc.  +  [bl\ 

0  =  [acjx  +  [bc]y  +  [cc]z  -|-  etc.  4-  [c/]  etc. 

Note  the  elegant  way  in  which  the  notation  [a6]  suggests 
a  sum  of  products  from  the  a  and  b  columns. 

Gauss’s  notation  for  elimination  is  equally  well  con¬ 
sidered.  The  following  is  from  the  Supplementum  [8]  to 


the  Theoria  Combinationis 

[66,1] 

=  [66]- 

[aa 

[6c,  1] 

=  [6c]- 

[o6l  ac] 

[HI] 

i) 

1 

[oo] 

etc. 

[cc,2] 

=  [cc]- 

Fa  a 

[cd,  2] 

=  [cd\- 

[ac]  ad] 
[oa) 

etc. 

[dd,3] 

1 

1 

II 

Ml- 

M 

WT 


]c^ 

[cc, 


Here  as  above,  a  pair  of  letters  indicates  the  position  in 
the  normal  equations.  The  appended  numerals  indicate 
the  level  of  elinndnation.  Incidentally,  this  seems  to  be 
the  first  appearance  of  the  inner  product  form  of  the 
algorithm,  in  which  the  matrix  R  is  generated  row  by 
row.  It  is  the  preferred  form  for  hand  calculation,  since 
one  need  only  record  an  array  of  numbers. 


9.  Legacy 

The  casting  of  Gauss’s  results  in  matrix  notation  in  some 
sense  trivializes  them.  With  our  knowledge  of  matrix 
algebra,  we  can  leap  ahead  to  results  that  researchers 
of  Gauss’s  time  could  only  arrive  at  by  more  pedestrian 
routes.  Yet  we  must  be  careful  not  to  be  patronizing. 
Gauss  and  his  successors  accomplished  a  great  deal  with 
their  techniques  and  notation. 

For  example,  Gauss’s  presentation  of  his  algorithm  as 
elimination  in  a  quadratic  form  strikes  us  as  unusual  to¬ 
day.  Yet  it  was  the  first  of  many  reductions  of  quadratic 
and  bilinear  forms  that  later  became  our  familiar  matrix 
decompositions,  including  among  others  the  LU  decom¬ 
position,  the  Jordan  canonical  form,  and  the  singular 
value  decomposition.  As  Kline  points  out  in  his  book 
Mathematical  Thought  from  Ancient  to  Modem  Times 
[13,  Ch.33],  by  the  time  the  use  of  matrices  had  become 
widespread,  many  of  the  principal  results  of  matrix  the¬ 
ory  had  already  been  established. 

Gauss’s  algorithms,  written  in  his  notation,  sur¬ 
vived  into  the  twentieth  century,  especially  in  books  on 
geodesy.  Thereafter,  as  people  began  to  use  present-day 
notation,  his  contributions  became  less  visible.  By  1959, 
when  I  first  began  working  with  computers,  Gaussian 
elimination  had  come  to  mean  any  triangularization  of 
a  system  of  equations,  symmetric  or  nonsynunetric,  fol¬ 
lowed  by  a  back  substitution,  and  none  of  us  had  an  idea 
of  what  Gauss  had  actually  done. 

Yet  what  he  did  is  worth  recalling.  Gauss  worked  with 
real-life  problems  and  got  his  hands  dirty  solving  them. 
He  always  looked  for  the  best,  most  efficient  algorithm; 
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and  when  he  had  it,  he  expressed  it  in  a  clean  notation 

that  suggested  how  to  use  it.  These  virtues  are  no  less 

important  today  than  in  Gauss’s  time. 
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Abstract 

A  simple  algorithm  for  estimating  the  regression  func¬ 
tion  over  the  United  States  is  introduced.  The  approach 
allows  for  data  obtained  from  a  complicated  sampling 
design,  as  well  as  for  the  inclusion  of  a  few  additional 
covariates.  The  regression  estimates  are  obtained  from 
an  associated  probability  density  estimate,  namely  the 
averaged  shifted  histogram.  The  algorithm  has  proven 
especially  successful  over  a  large  mesh,  say  300  by  200 
nodes,  in  a  data  rich  setting,  even  on  a  486  computer 
running  Splus.  Commonly  available  alternative  codes 
including  kriging  failed  to  produce  useful  estimates  in 
this  setting. 

1.  Introduction 

The  problem  of  nonparametric  regression  has  at¬ 
tracted  a  wealth  of  attention  since  the  pioneering  pa¬ 
pers  of  Nadaraya  (1964)  and  Watson  (1964);  see  Eu¬ 
bank  (1988)  and  Hardle  (1990).  Available  algorithms 
range  from  the  simple  running  median,  to  variational 
formulations  giving  rise  to  spline  estimates,  to  kernel  es¬ 
timates,  and  finally  local  polynomial  fitting.  There  has 
been  a  great  deal  of  recent  discussion  about  the  right  and 
wrong  way  to  do  nonparametric  regression.  Some  have 
argued  for  the  elegance  of  splines,  while  others  find  the 
local  polynomial  approach  compelling,  but  some  argue 
for  one’s  personal  preference. 

From  our  experience  in  the  density  estimation  setting, 
we  find  that  direct  methods  work  well  in  1  to  5  dimen¬ 
sions,  but  even  in  3-5  dimensions,  the  size  of  the  meshes 
is  growing  exponentially,  and  sufficient  data  often  aren’t 
available.  In  the  regression  setting,  we  find  that  the  dis¬ 
cussion  in  the  literature  has  focused  too  heavily  on  rel¬ 
atively  small  1  and  2  dimensional  data  sets  where  most 
methods  perform  reasonably  well.  In  this  manuscript, 
we  consider  a  more  realistic  and  stressful  problem  deal¬ 
ing  with  farm  data  such  as  that  routinely  surveyed  by 
the  U.S.D.A.  These  surveys  result  in  very  large  databases 
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over  nonuniform  spatial  meshes  (see  Figure  1),  compli¬ 
cated  by  nonuniform  weighting  schemes  as  well  as  inter¬ 
est  in  several  covariates. 

Large  data  sets  and/or  large  mesh  sizes  result  in  prac¬ 
tical  problems.  Too  many  regression  methods  have  so¬ 
lutions  or  algorithms  whose  exact  form  is  determined  by 
the  number  of  data  points  (splines,  kernels,  etc.)  that 
make  computation  infeasible  even  on  486  level  comput¬ 
ers.  The  key  to  computational  efficiency  is  the  same  as 
for  density  estimation:  binning  the  multivariate  data 
(Scott,  1992;  Hardle  and  Scott,  1992;  Fan  and  Marron, 
1994). 

Beyond  4  or  5  dimensions,  direct  mesh  methods  of  any 
kind  encounter  practical  difficulties  resulting  from  the 
curse  of  dimensionality.  Some  form  of  advanced  projec¬ 
tion  technology  or  additive  modeling  has  proven  useful 
(Hastie  and  Tibsharani,  1990). 

However,  ‘‘real  data”  can  throw  a  curve  at  the  best 
planned  evaluation  of  even  carefully  constructed  algo¬ 
rithms.  We  have  mentioned  the  special  problem  of  large 
samples.  Here  we  would  like  to  focus  on  problems  re¬ 
sulting  from  a  mixture  of  spatial  and  continuous  vari¬ 
ables.  They  are:  (1)  irregular  boundary  definition,  (2) 
data  collected  by  a  sampling  design,  and  (3)  a  very  large 
mesh  required  to  have  high  spatial  resolution.  In  princi¬ 
ple,  an  exact  irregular  boundary  scheme  can  be  handled 
(perhaps  with  great  programming  effort),  and  weighting 
can  be  introduced  into  the  estimation  phase.  However, 
many  simple-minded  implementations  run  into  numeri¬ 
cal  instabilities  with  large  meshes. 

We  wish  to  show  how  simple  the  binned  methods 
(specifically  the  ASH  or  WARP  algorithms)  can  be  mod¬ 
ified  to  handle  such  data,  even  with  very  fine  300  x  200 
spatial  meshes,  on  a  486  level  machine. 

We  find  that  the  common  focus  on  boundary  behavior 
is  only  a  minor  part  of  our  thinking.  Firstly,  we  are  deal¬ 
ing  with  large  samples  and  thus  only  a  relatively  small 
bandwidth  is  required.  (By  way  of  contrast,  many  sim¬ 
ulation  examples  involve  n  =  100  1-dimensional  data 
where  the  bandwidth  may  span  1/4-1/2  of  the  data 
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interval,  making  boundary  conditions  dominant.)  Sec¬ 
ondly,  for  mapping  purposes,  we  find  the  boundary  ef¬ 
fects  and  corrections  of  little  practical  importance  to¬ 
wards  understanding  and  summarizing  our  data  explo¬ 
ration/presentation  efforts. 

Ironically,  we  have  found  “internal  boundary”  situa¬ 
tions  more  of  a  practical  nuisance.  These  occur  in  ar¬ 
eas  internal  to  the  USA,  say,  where  there  are  no  data 
(because  there  is  no  agriculture),  inducing  a  boundary 
effect  caused  by  sparseness  rather  than  a  physical  exter¬ 
nal  boundary.  We  identify  this  situation  by  observing 
how  low  the  density  falls  in  each  region  where  we  are 
evaluating  the  regression  function  (i.e.,  how  close  to  0 
is  the  denominator?).  This  is  a  multivariate  version  of 
the  well-known  practical  problem  of  “extrapolation”  of 
regression  estimates  beyond  the  support  of  the  data. 

In  our  experience,  many  off-the-shelf  kriging  or  regres¬ 
sion  programs  cannot  handle  large  rectangular  meshes  of 
300  X  200  points  covering  a  mercator  projection  of  the 
lower  48  states.  Rewriting  such  codes  is  always  a  possi¬ 
bility,  but  we  have  found  that  the  simple  ASH  ideas  pro¬ 
vide  excellent  estimates  and  dramatic  correlation  with 
actual  photographic  evidence.  Carr  (1990)  has  used  raw 
(hexagonal  histogram)  bivariate  binning  techniques.  We 
are  interested  in  providing  some  additional  smoothing 
(that  will  provide  improved  estimation  quality)  as  well 
as  handling  additional  covariates. 


2.  Algorithm  Motivation 

We  start  with  a  simple  description  of  the  ideas  and 
algorithms  for  handling  {x^y,z)  data  where  [x^y)  rep¬ 
resents  the  center  of  one  of  our  bivariate  bins  (approx¬ 
imately  10  miles  by  10  miles)  containing  one  or  more 
U.S.D.A.  sampling  units.  The  variable  2:  represents  the 
quantity  of  interest;  for  example,  total  farm  income  or 
the  fraction  of  Federal  dollars  in  farm  income.  We  seek  to 
estimate  E  [Z(x,  y)]  or  z{x,y)  in  areas  where  /(a:,  y)  >  0. 

2.1.  Kernel  Regression  Estimation 

Let  K  be  a  symmetric  kernel  function  with  support 
on  (—1,1)  satisfying  =  1.  Given  a  positive 

smoothing  parameter  /i,  define  the  scaled  kernel  function 

by 

''‘W  =  (1) 

We  take  as  a  starting  point  the  well-known  result  (Scott, 
1992)  that  the  Nadaraya- Watson  bivariate  regression  es¬ 
timator 


is  the  exact  result  of  the  computation 


^{x,y)  =  j^zf{z\x,y)dz  = 


I  zf{x,y,z)dz 
ff(z:,y,z)dz 


where  the  trivariate  product  kernel  density  estimator  is 
given  by 


f{x,y,z)  = 

”  i=l 


a;,)  I<h(y  -  Vi)  Kh{z  -  Zi) . 


Clearly 

/  fix, y,z)dz  =  -  ^ Khix  -  Xi) Kh{y  -  yi) 

J  ”  <=1 

since  f  Kh(^  —  ^»)  =  /  Ar;j(z)  dz  =  1. 

Also,  /  zf{x,y,z)dz  =  Y^ZiKh{x  -  Xi)  Kh{y  -  y,), 
since 

J  zKh{z  -  Zi)  dz  =  ^i)^^h(z)  dz  =  Zi, 

recalling  that  f  zKh{z)dz  =  0  (by  symmetry). 

Clearly,  different  smoothing  parameters  hx^  hy, 
could  be  chosen  for  each  dimension.  Interestingly,  the 
particular  choice  of  hz  has  no  effect  on  the  regression 
estimate! 

It  is  well-known  (Hardle,  1990)  that  local  polynomial 
regression  (LPR)  estimators  and  spline  methods  have 
equivalent  kernel  forms.  LPR  does  have  the  advantage 
that  the  kernel  adjusts  properly  at  the  boundary  to  re¬ 
duce  bias  (Fan,  1992). 

However,  the  practical  gain  of  the  bias  correction  is 
often  small,  as  f{x)  — >  0  near  the  boundary  and/or 
m{x)  0  near  the  boundary.  Many  authors  consider 
only  cases  where  f{x)  is  nearly  constant  over  a  finite  in¬ 
terval,  or  even  the  simplest  case  of  a  fixed  equally-spaced 
mesh.  These  situations  tend  to  accentuate  boundary 
concerns  and  problems. 

2.2.  ASH  Density  Algorithm 

We  mimic  the  simple  Nadaraya- Watson  idea  except  on 
a  more  computationally  oriented  estimator,  the  averaged 
shifted  histogram  (ASH),  introduced  by  Scott  (1983, 
1985,  1992).  We  remotivate  the  multivariate  ASH. 

Let  us  slightly  alter  our  notation  so  that 

)  ^2)  •  •  • )  2/1 )  2/2 j  *  •  •  j  2/ny  y  •  •  • } 

are  the  midpoints  along  each  axis  of  a  trivariate  mesh  of 
size  Ux  X  riy  X  Uz  with  spacings  6xy6y,8z.  Thus 


m(x,y) 


E”=i  ^jKhjx  -  xi)Khiy  -  yi) 
Er=i  ^hix  -  Xi)Khiy  -  yi) 


Ax,'  =  8x 


hx 

rrix 


Avi  =  8y  Azi  =  62 


rriz 
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for  some  integers  rrix^  and  smoothing  parameters 

hxjhyjhz- 

Let  i/jki  denote  the  number  of  data  points  {x^y^z)i 
falling  in  bin  Bjkh  Note  that  Y^Ujki  =  n,  and  we  expect 
many  of  the  i/jki  to  be  0. 

The  “naive  ASH”  is  constructed  by  “computing”  rux  x 
triy  xrriz  (different)  trivariate  histograms,  each  with  rect¬ 
angular  bin  size  hx  xhy  xhz,  but  with  origins  shifted  by 
multiples  of  6x,Sy^  6z  along  the  coordinate  axes.  To  be 
specific,  one  bin  is  anchored  at  the  point  {jSxykSyjlSz)^ 
as  j,  k,  I  each  range  from  0  to  —  1,  1,  —  1. 

Scott  (1985)  showed  that  this  was  a  special  case  of  a 
general  weighting  scheme: 

fjkl  =  fi^jy  Vky^j)  =  ^  ^  ^ 

^  y  ^  a,6,c 

where  the  sums  range  over  —rrix  <  a  <  rrix^  —  <h  < 
jriyy  and  — <  c  <  ruz,  and 

K  (ife)  K  {^)  K 

where  K  is  supported  on  (—1, 1)  as  before.  Note  that  in 
an  obvious  notation,  Wahc  =  WaWbWc^  This  is  a  classic 
discretization  scheme.  The  weights  {watWi^Wc}  need 
only  be  computed  once. 

We  first  verify  that  the  tri variate  ASH  is  indeed  a  den¬ 
sity  function.  Clearly  it  is  nonnegative.  To  prove  that  it 
has  integral  1,  we  compute 

f{x,  y,  z)  dx  dydz  =  6x6y6z 

i  k  I 

j  k  I  a  b  c 

a  b  c  j  k  I 

a  b  c 

assuming  a  buffer  of  O^s  around  the  edges  of  the  {i^jki} 
array,  so  that 

53  ^j+a.k+Ki+c  =  1  for  all  a,b,c. 

j  k  I 

In  practice,  the  array  {fjki}  is  initialized  to  all  O’s,  and 
then  the  influence  of  every  bin  Bjki  for  which  >  0  is 
added  to  the  appropriate  subset  of  fjkh 

We  could  define  /(x^y^z)  to  be  a  spline  surface  in¬ 
terpolated  from  the  above  array,  but  for  simplicity,  we 
take  it  to  be  constant  over  each  bin  Bjki  and  assume  it 
vanishes  outside  the  mesh;  that  is,  /(a?,  y^z)  —  0  there. 


2.3.  ASH  Regression  Algorithm 

Following  the  N ad araya- Watson  motivation,  the  ASH 
regression  estimator  is  found  by  computing 

mjk  =  m{xj,yk)  =  E(Z|Z  =zXj,Y  =  yk) 

f  r,  I  \j  f  ^fi3:j,yk,z)dz 
=  /  zf{z\xj,yk)dz=  ^ — . 

J  f{xj,yk) 

The  numerator  can  be  computed  by  integrating  bin  by 
bin  along  the  z  axis: 

^  rzi-\-Sx/2  ^  ^ 

2^  /  ^/(®i ,yk,z  =  z,)dz  =  ^  S;,zif{xj ,  yk,  z,) , 

1=1  dzt^6s/2 

since  f  z  dz  =  SzZi  for  the  limits  given  (recall  /  is  con¬ 
stant  over  each  bin).  Thus 

iCfe  Y2c  ^afrc^i*4-a,A:+6,/+c} 

TTljJg  1 . — --r,  ,  -  - . - 

n6^6y  l^a 

_  Eg  Hb  Ee  Er=l  Zll'j+a,k+i,l+c 
Eg  Ei  ^gi*^i+a,t+6 

Now  the  final  sum  in  the  numerator  can  be  computed  by 
observing  that  it  is  almost  a  conditional  expectation: 

n, 

EUj^a,k^b,l+c 

' - ^3-^a,k^b  =  Zabi^3-^a,k-\.b 

/=1 

as  we  let  ►  00  (or  equivalently  let  0  with  hz 

fixed),  where 

Zah  =  —  Y]  Zi  ♦ 
riab  , 

(x,y,z)i6Bab 

Continuing,  we  note  that  =  1,  so  that  we  finally 

arrive  at  the  final  form  of  the  ASH  regression  estimator 


-  .  _  Z^g  ^6  ^ah'^ah^f-\‘a,k+b 

Tfljk 

2-^a  '^ab^j+a,k+b 

2,4.  ASH  Regression  Extensions 

REMARK  1:  For  the  survey  sampled  data,  each  data 
point  takes  the  extended  form 

{(x,y,z,a)i,  t  = 

where  cr,-  is  the  effective  sampling  weight.  Previously, 
we  have  assumed  that  =  1  for  all  cases.  Here,  the 
frequency  counts  i/ju  are  replaced  by  the  sum  of  these 
ai  weights  rather  than  Ts. 


G.  Whittiker  and  D.  W.  Scott  1 1 


REMARK  2:  Occasionally,  our  data  will  include  other 
covariates  and  be  of  the  form 

where  t  is  some  covariate  of  interest.  Then  we  com¬ 
pute  the  ASH  regression  estimator  m(ar,y,  t)  by  simply 
adding  another  loop  to  the  numerator  and  denominator 
of  the  thjk  equation  above.  The  sampling  weights  are 
the  same  of  course.  What  could  be  easier?  Typically,  we 
will  map  the  estimate  at  several  levels  off,  for  example, 

m(ar,y,f  =  fo)‘ 

REMARK  3:  The  1-dimensional  ASH  regression  pre¬ 
scription  was  first  published  in  Hardle  and  Scott  (1992) 
under  the  name  WARPing. 

3.  Mapping  Details 

After  the  “usa()”  is  plotted,  the  regression  ASH  is 
computed  over  the  entire  300  x  200  mesh  and  added  to 
the  figure  by  using  either  the  Splus  "contour”  or  “image” 
function  and  the  argument  "add=T” .  Typically,  the  con¬ 
tour  lines  will  extend  slightly  outside  the  US  borders.  A 
simple  trick  removes  those  lines,  by  applying  “polygon” 
to  two  pieces  that  outline  half  the  borders  of  the  US  and 
the  surrounding  rectangles.  This  will  be  illustrated  in 
the  examples. 

The  internal  boundary  solution  is  not  handled  in  an 
elegant  fashion  currently.  Thresholding  could  be  applied, 
but  we  find  the  problem  is  relatively  localized  and  have 
left  it  for  the  reader  to  discover.  A  bootstrap  algorithm 
has  been  implemented  to  estimate  the  pointwise  error. 
We  have  used  this  to  replace  or  delete  regions  where  the 
estimator  behaves  erratically. 

4*  Examples 

The  “real”  data  considered  in  this  section  come  from 
the  Farm  Costs  and  Returns  Survey.  This  is  a  stratified 
complex  design  survey  which  is  used  to  measure  finances 
and  production  of  all  U.S.  agriculture.  The  weight  of 
each  observation  was  taken  to  be  the  inverse  of  the  prob¬ 
ability  of  selection.  We  begin  with  a  small  bivariate  sim¬ 
ulation. 

4.1.  A  Simulation  Example 

A  surface  with  3  bumps  typical  of  those  encountered 
in  USD  A  work  was  constructed  on  a  50  x  50  mesh  (not 
shown).  The  surface  was  contaminated  twice:  first  with 
Gaussian  noise  and  then  with  Cauchy  noise.  From  this 
complete  set  of  2,500  points,  200  points  were  selected  at 
random.  The  estimated  ASH  regression  surface  was  com¬ 
puted  with  nix  =  =  5*  The  trimodal  structure  was 

evident,  but  then  so  were  some  spurious  peaks  induced 


by  the  Cauchy  noise.  Clearly,  the  raw  ASH  algorithm 
has  no  robustness  component  included. 

We  next  applied  the  loess  (Cleveland,  1979)  Splus 
function  to  these  data.  A  coplot  of  x  vs.  z  given  y  was 
computed  and  a  perspective  plot  of  the  entire  estimated 
surface  examined.  The  loess  surface  is  significantly  bet¬ 
ter  as  it  includes  iteration  to  provide  more  robust  an¬ 
swers  to  minimize  the  effects  of  the  Cauchy  noise. 

4.2.  Farm  Costs  and  Returns  Survey  Example 

A  sample  of  n  =  13,000  of  1.7  million  farms  was  drawn. 
For  these  data,  the  FIPS  code  for  each  observation  was 
known.  Thus  the  exact  location  of  each  observation  was 
assigned  to  the  location  of  the  population  centroid  of 
the  county  where  the  farm  is  located.  The  map  of  the 
3,100  centroids  is  shown  in  Figure  1.  Observe  that  the 
resolution  is  much  greater  east  of  the  Mississippi. 

When  loess,  kriging,  and  other  methods  were  applied 
to  these  data,  each  failed  to  produce  a  usable  surface 
from  the  data.  The  result  was  always  a  smooth  surface 
for  most  of  the  country  with  an  enormous  peak  at  an 
edge.  However,  the  ASH  regression  algorithm  with  = 
niy  =  5  produced  excellent  results. 

We  first  computed  the  estimate  without  using  the  sam¬ 
pling  weights  as  shown  in  Figure  2,  while  the  estimate 
with  sampling  weights  is  shown  in  Figure  3.  This  made  a 
big  difference,  particularly  in  areas  where  there  are  many 
observations  with  small  weights. 

As  mentioned  earlier,  internal  boundaries  can  cause 
problems  for  the  algorithm.  In  Figure  4,  we  zoom  in 
on  one  of  the  problem  areas.  The  four  corners  region  of 
the  Southwest  (Utah,  Arizona,  Colorado,  and  New  Mex¬ 
ico)  join  at  about  the  location  where  this  peak  occurs. 
The  surface  rises  gradually  to  the  peak,  becomes  a  fiat 
plateau,  then  drops  off  a  cliff  to  an  area  of  no  data  (where 
the  regression  estimator  becomes  0/0).  Use  of  zipcode 
centroids  and  adaptive  bandwidths  might  solve  this. 

Figure  5  captures  our  final  estimate  of  the  fraction  of 
government  payments  to  gross  farm  income.  Note  the 
contours  are  shown  on  a  logarithmic  scale.  The  bound¬ 
ary  artifact  in  the  four  corners  region  can  be  searched 
out.  Otherwise,  no  other  glaring  boundary  problems  ap¬ 
pear.  For  the  most  part,  the  value  of  the  regression  sur¬ 
face  is  quite  small  near  the  US  borders,  except  in  Texas 
and  along  a  portion  of  the  border  with  Canada  (where 
government  subsidies  are  even  greater!).  We  do  not  find 
the  bias  incurred  particularly  misleading. 

Next,  we  included  a  surrogate  variable  t  to  capture  the 
“size”  of  each  farm.  This  was  simply  the  total  sales.  We 
computed  m(ar,  y,  <)  using  the  extended  ASH  algorithm 
and  computed  2  slices — one  for  small  farms  (Figure  6) 
and  one  for  large  farms  (Figure  7).  The  highest  subsidies 
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for  small  farms  are  concentrated  primarily  in  the  Mid¬ 
west  and  Plains  states.  For  large  farms,  we  see  the  rice 
farms  along  the  Mississippi,  for  example.  These  patterns 
are  quite  interesting  to  policy  makers. 

4,3.  Overlaying  Maps 

A  popular  exercise  is  overlaying  different  maps  to  cap¬ 
ture  a  relationship.  Conventionally,  this  is  done  follow¬ 
ing  county  boundaries.  For  example.  Figure  8  displays 
such  data.  The  viewer  is  required  to  form  a  “mental  sur¬ 
face”  or  internal  representation  for  these  data.  The  ASH 
algorithm  does  this  for  the  viewer,  with  the  added  ad¬ 
vantages  of  consistency  and  the  application  of  objective 
statistical  criterion  to  decide  the  contours  of  the  surfaces. 
In  Figure  9,  4  shades  are  indicated  on  the  map  coming 
from  2  ASH  estimates.  White  areas  indicate  low  activity 
on  both  scales.  The  darkest  shaded  areas  indicate  where 
both  (1)  farms  are  dependent  on  government  payments 
and  (2)  the  geographical  areas  are  highly  dependent  on 
farm  income.  Such  information  is  more  easily  gleaned 
from  these  smooth  ASA  estimates. 

5,  Discussion 

The  naive  ASH  is  not  robust,  but  is  easily  adapted  to 
handle  weighted  data  and  covariates  with  small  compu¬ 
tational  overhead.  Elegant  procedures  without  covariate 
handling  have  been  considered  by  Tobler  (1979).  We 
have  not  taken  advantage  of  possible  small  gains  avail¬ 
able  by  considering  spatial  correlations. 

However,  kriging  and  lowess  both  produced  estimates 
with  huge  values  at  the  boundary  and  outside  the  US 
borders.  Apparently,  the  trick  of  placing  a  rectangular 
grid  on  the  US  extending  outside  the  borders  fails  be¬ 
cause  the  algorithms  require  explicit  knowledge  of  the 
boundary  locations  as  input. 

The  actual  proximate  reason  for  failure,  interestingly 
enough,  is  due  to  the  “adaptive”  nature  of  these  algo¬ 
rithms,  which  fit  the  LPR  over  a  region  with  a  certain 
fraction  of  the  data.  In  places  where  the  mesh  extends 
offshore,  the  regression  estimate  is  reaching  far  inland 
for  any  data  to  fit  —  the  extrapolation  problem  once 
again.  (Explicit  boundary  handling  would  fix  this,  pre¬ 
sumably)  . 

The  ASH  procedure  used  a  fixed  (or  nonadaptive) 
neighborhood.  The  result  is  regions  where  the  regres¬ 
sion  estimate  is  undefined  (0/0).  However,  we  are  more 
comfortable  with  such  undefined  regions  than  with  pro¬ 
viding  dubious  estimates  obtained  by  spanning  empty 
spaces. 
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Figure  1.  Population  centroids  of  all  U.S,  counties, 


Figure  2.  ASH  estimates  with  equal  weights  «,•  =  !.  Note  the 
low  values  east  of  the  Mississippi  River. 
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Figure  5.  Contours  of  the  proportion  of  farm  income  from  Federal  estimated  by  the  ASH. 


Figure  6.  Conditional  distribution  of  the  variable  in  Figure  5  for  “small”  farms. 
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Figure  7.  Data  as  in  Figure  6  but  for  “large”  farms. 


Figure  8.  Data  presented  in  the  usual  fashion,  on  county-by- county  basis. 
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Farm  and  Payment  dependent 
Farm  dependent 

Payment  dependent  ( >  0.1  of  gross) 


Figure  9.  Overlay  of  2  ASH  regression  estimates  (see  text). 


1 8  Fast  and  Stable  Computation  of  Local  Polynomials 


Fast  and  Stable  Computation  of  Local  Polynomials 

Burkhardt  Seifert  Michael  Brockmann  Joachim  Engel  Theo  Gasser 

Abteilung  Biostatistik  Universitat  Heidelberg  Wirtschaftstheoretische  Abt.  II  Abteilung  Biostatistik 


Universitat  Ziirich  Im  Neuenheimer  Feld  294 
Sumatrastrasse  30  D-6900  Heidelberg 

CH-8006  Zurich 

Abstract 

Naive  implementations  of  local  polynomial  fits  require 
almost  0{n^)  operations.  In  this  paper  a  fast  0(71)-“ 
algorithm  is  presented.  It  is  based  on  updating  normal 
equations.  Numerical  stability  is  guaranteed  by  center¬ 
ing  while  moving,  controlling  ill-conditioned  situations 
for  small  bandwidths  and  data-tuned  restarting  the  up¬ 
dating  procedure.  “Exact  binning^^  and  restarting  at  ev¬ 
ery  output  point  results  in  a  moderately  fast  but  highly 
stable  algorithm.  Applicability  of  algorithms  is 

evaluated  for  estimation  of  regression  curves  and  their 
derivatives. 

Some  key  words:  Fast  computation;  Local  polynomi¬ 
als;  Nonparametric  estimation;  Nonparametric  re¬ 
gression;  Smoothing;  Updating. 

AMS  1991  subject  classification.  Primary  65D10,  Sec¬ 
ondary  62G07,  65D25. 

1  Introduction 

Nonparametric  methods  of  curve  estimation  have  be¬ 
come  useful  techniques.  For  applications  fast  algorithms 
which  allow  computation  on  personal  computers  and  at 
the  same  time  guarantee  numerical  stability  are  highly 
desirable.  In  particular,  when  choosing  the  bandwidth 
from  the  data  or  in  bootstrapping  schemes,  multiple  eval¬ 
uations  of  the  estimators  become  necessary  and  a  fast 
algorithm  is  even  more  desirable.  Furthermore,  due  to 
technical  progress,  automatic  recording  of  mass  data  has 
become  easier.  This  puts  higher  demands  on  statistical 
algorithms. 

For  various  spline  based  regression  estimators  algo¬ 
rithms  have  been  developed  whose  number  of  arithmetic 
operations  grows  only  linearly  with  the  number  of  data 

^This  work  was  part  of  the  researdi  program  no.  21.-36042.92 
of  the  Swiss  NSF 
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points  n  (see  de  Boor,  1978;  Utreras,  1980,  1981;  Silver- 
man,  1984;  Hutchinson  &  de  Hoog,  1985).  In  contrast,  a 
naive  implementation  of  a  kernel  estimator  for  regression 
or  density  estimation  requires  almost  0{n?)  operations. 
Through  averaging  shifted  histograms  Scott  (1985,  1986) 
proposed  a  fast  density  estimator  approximating  a  ker¬ 
nel  estimator  which  needs  0{n)  operations.  Hardle  & 
Scott  (1992)  extended  this  idea  through  their  concept  of 
WARPING  (weighted  average  of  rounded  points)  to  the 
regression  case  where  their  estimator  approximates  the 
Nadaraya-Watson  kernel  estimator.  A  fast  algorithm 
for  an  exact  convolution  type  kernel  regression  was  sug¬ 
gested  by  Gasser  k  Kneip  (1989).  Seifert,  Brockmann, 
Engel  &  Gasser  (1994)  presented  two  fast  0{n)  algo¬ 
rithms  and  a  highly  stable  but  slightly  slower  0(n7^^) 
version  of  the  latter  algorithm.  The  algorithms  are  ap¬ 
plicable  to  local  polynomial  regression  and  to  kernel  es¬ 
timation. 

This  paper  is  based  on  Seifert  et  al.  (1994).  In  sec¬ 
tion  2  the  local  polynomial  regression  estimator  is  briefly 
discussed.  In  section  3  a  fast  algorithm  is  derived.  Its 
speed  is  based  on  updating  normal  equations  and  the 
idea  of  exact  binning.  Stability  is  obtained  by  several 
steps,  centering  while  moving,  control  of  ill-conditioned 
matrices  and  data-tuned  restart  of  the  updating  proce¬ 
dure  being  the  most  important  ones.  Restarting  at  every 
output  point  results  in  a  moderately  fast  0(n^/®)  algo¬ 
rithm,  which  is  even  more  stable  than  the  conventional 
one.  A  numerical  evaluation  is  given  in  section  4  for  esti¬ 
mation  of  regression  curves  and  their  derivatives  in  fixed 
and  random  designs. 

2  Local  Polynomial  Regression 

Let  (Xi,Yi),  ...,(X„,yn)  be  a  set  of  independent  and 
identically  distributed  pairs  of  random  variables  where 
the  Xi  are  scalar  predictors  and  the  Yi  are  scalar  re¬ 
sponses.  The  developments  of  this  paper  can,  however, 
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be  generalized  to  higher-dimensional  design. 

In  regression  analysis  a  functional  relationship  be¬ 
tween  predictor  and  response  is  assumed  as 

r(x)  =  E(Y|X  =  x).  (1) 

Predictors  following  a  fixed  design  can  be  treated  sim¬ 
ilarly.  The  predictors  are  assumed  to  be  sorted  Xi  < 
...  <  Xrx  ^  The  goal  is  to  estimate  r(xo)  or  its  u-th 
derivative  r^‘'')(xo)  =  for  some  u.  The  lo- 

cal  polynomial  approach  is  based  on  the  approximation 


r 


rO’)(a:i) 


j' 


provided  x  is  close  to  xi ,  where  r  is  at  least  (p-h  1)  times 
differentiable.  This  representation  suggests  minimizing 

t  -  Eft  -  *■>’)  (^) 

with  respect  to  /?  =  {^o,  •  •  • ,  ^p)  ■  Here  K  denotes  a  pos¬ 
itive  and  symmetric  weight  function  and  h  is  the  band¬ 
width.  Denote 


X  = 


1  (Xi-xi)  ...  (Xi-xi)P\ 

1  (Xn—Xi)  ...  {Xn—XiY  /  „x(p+l) 


Y  = 


=  diag(/^(^^ly^),...,K(^^)), 


5„  =  X'WX 


r„  =  X'WY=z 


Then  the  solution  of  the  least  squares  problem  (3)  is 
obtained  as  solution  $  of  the  linear  system 


Sn$  —  Tn- 


The  resulting  local  polynomial  Yi%a 

pendent  of  xi .  We  estimate  the  i/-th  derivative  of  r  at 


point  xq  by 

f('')(xo)  =  u\j2  P)  (*o  -  h  .  (7) 

k=v  '  ' 

We  assume,  that  X  has  full  rank,  i.e.  that  there  are  at 
least  p  -t- 1  points  in  the  local  smoothing  interval.  Then 
f(‘')(xo)  is  unique.  Algorithmically  this  is  achieved  by 
increasing  the  bandwidth  locally  until  p  +  I  points  fall 
in  the  interval. 

Asymptotic  properties  ate  studied  in  Fan  (1993), 
Ruppert  &  Wand  (1992)  and  Fan  et  al.  (1993).  In  the 
latter  it  is  shown  that  f(‘')(xo)  is  an  asymptotically  mini¬ 
max  efficient  estimator  among  all  linear  estimators.  The 
Epanechnikov  weight  function  iiL(x)  =  (3/4)  (1  —  x^)+ 
is  optimal  for  estimating  the  regression  function  r  itself, 
as  well  as  its  i/-th  order  derivative.  Note  that  p—v>0 
should  be  odd  according  to  asymptotic  theory,  and  that 
usually  p—i/  is  equal  to  1  or  at  most  3  due  to  the  local  na¬ 
ture  of  the  approximation.  The  local  polynomial  method 
automatically  adapts  to  the  boundary;  the  equivalent 
kernel  is  a  boundary  kernel  as  defined  by  Gasser  et  al. 
(1985).  This  feature  of  the  local  polynomial  method 
saves  extra  computations  at  boundary  points. 

3  Algorithms  for  Local  Polyno¬ 
mial  Fitting 

3.1  The  conventional  gJgorithm 

Using  xi  =  xo  we  have 

s».i  =  w 

(9) 

»=1  ^  ' 

Thus,  finite  moments  with  respect  to  the  design  points 
essentially  determine  the  local  polynomial  fit.  This  is 
also  true  for  higher  dimensional  design. 

Once  Sn  and  T„  have  been  computed,  the  local  poly¬ 
nomial  fit  is  obtained  by  solving  the  linear  system  (6). 
The  computational  effort  is  independent  of  n .  (We  will 
approach  the  problem  of  solving  the  normal  equations 
later  on,  using  the  Cholesky  decomposition.)  Hence  a 
fast  algorithm  relies  on  the  fast  computation  of  Sn  and 
Tn  over  the  entire  output  grid. 

UsuaJly,  the  output  grid  will  consist  of  n  points  as  the 
input  grid  (e.g.  for  cross-validation),  or  of  a  fraction  of 
n,  if  n  is  large,  or  a  multiple  of  n,  if  n  is  small  (e.g.  for 
graphical  representation).  If  the  number  of  points  in  the 
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output  grid  is  thus  m  =  0{n)^  then  a  conventional  imple¬ 
mentation  of  (6)  will  require  0{n^h)  operations,  based 
on  weight  functions  with  compact  support.  For  standard 
regression  estimation  the  optimal  h  is  of  order 
leading  to  0{n^l^)  operations  for  a  curve  fit.  However, 
for  small  bandwidths  h  =  (when  the  estimator  is 

close  to  interpolation)  the  conventional  implementation 
approaches  0(n)  operations. 

Now  we  derive  fast  algorithms,  based  on  a  polynomial 
weight  function 

(10) 

Jb=:0 

comprising  in  particular  the  optimal  Epanechnikov 
K{x)  =  (3/4)  (1  -a:^)+  and  the  minimum  variance  (uni¬ 
form)  weights.  For  simplicity  we  will  present  the  algo¬ 
rithms  only  for  Sn  ,  since  the  computation  of  Tn  is  then 
straightforward.  Moreover,  we  present  the  case  of  a  con¬ 
stant  or  global  bandwidth  but  in  fact  our  algorithms 
work  for  local  bandwidths  h  =  h{xo)  as  well. 

3.2  A  ^^naive”  fast  algorithm  —  the  idea 
of  updating 

Using  the  binomial  formula  in  (8)  and  rearranging  sum¬ 
mation  we  get 

Snj 

=  E  (e«*  1[*«-Mo+m(^«)(^.-  -*0/ 

=  E  E(^.-  (11) 

Jk=0  *=1 

X  (12) 

Given  the  value  of  Snj  at  a:o»  we  can  save  a  lot  of 
computations  by  reusing  the  inner  sums  (in  braces)  over 
i  when  calculating  SnJ  at  the  next  output  grid  point  xoi 
say.  From  the  inner  ^um  we  subtract  the  terms  that  are 
not  in  [aroi  — A,  aroi  +  /i],  and  add  those  terms  which  are  in 
this  interval,  but  dp  not  belong  to  [a;o  —  h,  xq  +  A].  This 
results  in  a  fast  0{n)  algorithm,  which  is  reminiscent  of 
the  old  add/subtract  box  car  smoothing  (compare  e.g. 
Eddy,  1980).  Independent  of  h  and  j  one  has  to  calcu¬ 
late  the  terms  i  =  1, . . . ,  n,  0  <  ^  <  2p  +  a  only 


once.  However,  this  algorithm  is  numerically  instable. 
The  main  source  of  instability  is  the  expansion  of  the 
term  {Xi  -  The  add/subtract  idea  then  leads 

to  an  accumulation  of  numerical  errors.  The  problem  is 
comparable  to  the  well  known  instability  of  the  textbook 
one-pass  algorithm  for  estimation  of  a  variance. 

One  way  out  is  the  use  of  centered  quantities  (X,*  — 
a:o)^  only,  or  quantities  centered  by  Xq  ,  the  mean  of 
design  points  in  the  interval  [xo—hj  Xo+h]  (as  is  common 
use  in  polynomial  regression  and  done  in  this  paper). 

If  we  move  towards  the  boundary,  an  increasing  nu¬ 
merical  instability  is  expected  and  observed.  Then  typ¬ 
ically  the  number  of  points  in  [aro  —  A,  aro  +  A]  decreases, 
which  leads  to  smaller  quantities  Snj  ,  and  hence  in¬ 
creasing  relative  numerical  errors.  Also,  the  weights  at 
the  boundary  become  larger  by  order  of  magnitude.  This 
difficulty  is  dealt  with  by  running  from  both  ends  to  the 
middle  of  the  estimation  interval. 

Our  goals  are  the  following:  We  would  like  to  have 
a  fast  and  stable  algorithm  over  the  entire  domain  of 
bandwidths,  starting  with  an  A  containing  the  minimal 
number  of  design  points  which  is  p  -f  1  and  going  up  to 
the  maximal  A.  Numerical  stability  should  be  guaran¬ 
teed  for  the  Epanechnikov  and  the  uniform  weight  func¬ 
tion,  i.e.  a  =  2  and  a  =  0 .  Of  interest  are  the  regression 
function  itself  (1/  =  0)  and  the  first  and  second  deriva¬ 
tive  (1/  =  1,2),  whereas  1/  =  3,4  might  be  needed  for 
estimating  smooth  functionals  only,  e.g.  for  selecting  op¬ 
timal  bandwidths  (Gasser  et  al.,  1991).  Usually,  we  are 
satisfied  to  use  a  polynomial  of  order  p  =  r/  -f  1 ,  but  for 
1/  =  0, 1,2  the  choice  of  higher  order  p  also  may  be  of 
interest. 

The  above  algorithmic  steps  will  not  be  sufficient  to 
reach  these  goals.  The  most  important  additional  tech¬ 
niques  consist  of  detecting  ill-conditioned  cases  for  small 
bandwidths  and  automatic  restarting  the 'updating  pro¬ 
cedure,  based  on  properties  of  the  computed  matrix  Sn 
(see  section  3.4  below). 

3.3  A  fast  and  stable  algorithm  —  the 
idea  of  centering  while  moving 

To  avoid  numerical  instability  of  the  naive  fast  algorithm 
based  on  (12),  it  is  necessary  to  use  centered  quantities 
only.  In  Seifert  et  al.  (1994)  two  stable  algorithms  using 
xi  =  Xq  and 

n 

^  ^  Xi  I[a;o--/i,a:o+/»](-^») 

Xi=:Xo  =  ^ - 

^la?o-h,a;o-i-h](Xi) 

i=l 
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were  presented.  Here,  we  present  a  fast  algorithm  us¬ 
ing  xi  =  Xq,  the  mean  of  design  points  to  be  used  for 
estimation  of  r(''^(xo)  •  Then 

Snj 

»=1  \jfc=0  ^  ^  / 

t=l  Jb=0 

^=0  ^  ^ 

=  X)  (/) 

jfe=0  ^=0  ^  ' 

X  (Xi  -  Xoy+^  I[z„-A,*o+k](^.)|  (13) 

This  leads  to  a  representation  of  local  polynomials  in 
central  (sample)  moments  (in  braces) 

n 

rUj  =  ^2  (^»  ” 
i=l 


Note,  that  mi,£  =  0  and  mo,£  =  ni .  Denote 

d  =  Xi--X,  (17) 

We  get  X2-  X  ==  -ni  d/n2  and  for  j  >2 


A  subsample  is  removed  (“subtract”-part  of  updating) 
bv 

Xi=-X  +  n2{X-X2)/ni  (19) 

and 

=  £  {Xu  -  Xif + £;  {X2i  -  x^y  -  -  ^1)' 

t=l  *=1  »=1 

= E  (i)  "■*  -  E  i’t)  • 

Jb=0  *=0  '  ' 

Using  ^2  ~  =  -(ni  +  n2)d/n2,as  before 


What  remains  is  to  find  a  fast  and  stable  updating  for¬ 
mula  for  rrij  .  For  this  purpose  we  generalized  a  formula 
for  pooling  estimates  of  variance  {j  =  2)  by  Chan,  Golub 
&;  LeVeque  (1983).  Their  formula  is  known  to  be  fast 
and  stable.  It  has  been  independently  introduced  by 
Spicer  (1972)  for  the  computation  of  central  moments 
{j  =  1  to  4).  Suppose  we  have  two  distinct  subsam¬ 
ples  Xii , . . . ,  Xint  with  means  Xi  and  central  moments 
ruj^t  A  —  1)2.  Denote  by  X  and  mj  the  mean  and  cen¬ 
tral  moments  of  the  union  of  both  subsamples.  Then  the 
‘‘add” “p art  of  updating  becomes 

X  ^  Xi  n2  (.^2  “  ^1)  /  (^1  +  ^2)  (15) 

and 

rUj  =  f^{Xu-Xy  +f^{X2i-Xy 

t=l  1*=1 

=  E 

Jb=0  ^  ^ 

+  E  (i)(^2-X)^-‘'mfc,2.  (16) 


Updating  the  central  moments  (14)  using  these  formulae 
results  in  an  overall  0{n)  algorithm. 

Figure  1  shows  the  numerical  error  of  the  resulting 
fast  algorithm,  compared  with  the  conventional  one. 
Note,  that  the  fast  algorithm  starts  at  both  ends  and 
runs  to  the  middle  of  the  interval.  It  can  be  seen,  that 
centering  at  xi  =  Xq  may  have  numerical  advantages 
over  centering  at  xi  =  xq  ,  especially  at  the  boundary. 

As  can  be  seen,  round-off  errors  may  accumulate  and 
restarting  will  be  used  to  stabilize  the  updating  proce¬ 
dure.  The  loss  in  computational  speed  is  reduced  by 
‘•Exact  binning”:  Consider  a  hypothetical  partition 
of  the  whole  sample  into  subsamples  Xn , . . . ,  Xino  of 
length  no  (bins).  If  the  algorithm  is  restarted  at  xq  , 
say,  and  h  is  large  enough,  the  points  in  the  interval 
[xQ-hyXo-\-  h]  are  divided  into  a  left  part  with  less  than 
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Figure  1:  Numerical  error  of  the  fast  algorithm  (with¬ 
out  restarts) f  compared  with  the  conventional  one  for  the 
Epanechnikov  weight  funciiouj  p  =  1,  n  =  1000  random 
uniform  design  points  and  h  =  0.25*  Solid  line  is  fast, 
dots  are  conventional  algorithm. 

no  observations,  the  central  part  consisting  of  subsam¬ 
ples  of  length  no  (complete  bins),  and  a  remaining  right 
part.  Once  a  partition  into  bins  has  been  chosen,  the 
central  moments  of  any  bin  are  independent  of  the  band¬ 
width  h  and  the  output  point  xq  .  Hence  storage  of  mo¬ 
ments  leads  to  savings  in  computation  time:  Given  that 
central  moments  of  such  a  bin  have  been  computed,  they 
are  stored  and  can  be  used  for  estimation  if  restarting  at 
another  output  point,  with  another  bandwidth  or  a  new 
(smaller)  polynomial  order  p  since  they  are  independent 
of  these  quantities.  This  option  of  binning  is  particu¬ 
larly  attractive  in  case  of  iteration  as  e.g.  for  plug-in 
bandwidth  selection  (Gasser  et  al.,  1991)  or  when  the 
same  design  occurs  repeatedly.  The  following  argument 
is  helpful  when  choosing  a  bin  width  no  .  If  the  moments 
of  the  central  parts  are  already  available,  the  computa¬ 
tion  of  rrij  reduces  from  0{nh)  to  0{nhnQ^)  -f  O(no) 
operations.  Consequently  no  should  be  0((n/i)^/^).  In 
the  usual  binning  only  the  first  moments  are  retained 
which  leads  to  an  approximation  error  there. 

The  add-part  (15)  and  (18)  of  the  updating  formula 
allows  the  construction  of  a  moderately  fast  but  highly 
stable  algorithm:  Computation  of  central  moments  of 
bins  of  length  no  needs  0(n)  steps.  Restarting  at  every 
output  point  results  in  0(m  n  A  nr  ^) -}- 0(m  no)  opera¬ 
tions.  For  m  =  0(n),  h  =  (^(n-^/®),  and  taking  an  op¬ 
timal  no  =  0(n^/®),  we  get  an  algorithm  with 
operations  compared  to  0(n^^®)  of  the  conventional  one. 
The  computation  of  central  moments  using  exact  binning 
is  more  stable  than  the  standard  two-pass  algorithm,  so 
we  can  expect  an  algorithm  that  is  not  only  faster  but 
also  more  stable  than  the  conventional  one.  Like  the  con¬ 
ventional  one  this  algorithm  approaches  0{n)  operations 


for  small  bandwidths  h  =  0(n”^)  . 

3.4  Solution  of  the  normal  equations  and 
automatic  restart 

Cholesky  decomposition  was  used  to  solve  the  normal 
equations  (6)  for  the  following  reasons: 

•  The  matrices  of  coefficients  Sn  are  positive  definite. 

•  The  Cholesky  decomposition  is  fast. 

•  The  numerical  stability  is  scale  invariant  and  proved 
to  be  good  for  the  cases  of  interest.  This  fact  led 
to  the  decision  not  to  use  orthogonal  polynomials, 
which  would  decrease  computational  speed. 

Also  the  Cholesky  decomposition  can  be  used  to  solve 
the  following  two  numerical  problems: 

•  control  the  numerical  condition  of  the  normal  equa¬ 
tions, 

•  control  the  accuracy  of  the  updating  procedure  for 
computing  the  normal  equations  by  appropriate 
restarting. 

For  this  we  need  some  theoretical  analysis  of  Cholesky 
decomposition. 

Cholesky  decomposition:  The  decomposition  is  of 
the  form 

Sn^LDL'. 

L  =  {{tjk))  is  a  lower  triangular  matrix  with  diagonal  el¬ 
ements  ijj  ~  1 .  D  =  diag(dj)  is  the  diagonal  matrix  of 
Cholesky  factors.  The  normal  equations  are  then  solved 
step  by  step.  The  well  known  formulae  for  the  decompo¬ 
sition  use  only  the  four  fundamental  rules  of  arithmetic: 

dj  =  Sjj  —  ^  £jig  dk  ,  (21) 

k<j 

(■jk=  —  j  dk.  (22) 

Cholesky  factors  dj  should  be  sufficiently  away  from 
zero  compared  to  Sjj  to  avoid  the  loss  of  significant  digits 
in  (21).  The  ratios  dj  /  sjj  are  scale  invariant.  It  will  be 
shown,  that  they  are  hardly  affected  by  the  bandwidth 
h  and  by  sample  size  n,  whereas  the  local  shape  of  the 
design  density  /  may  matter.  Due  to  its  sensitivity  the 
last  ratio  dp^i  /  is  used  to  assess  stability  and 

is  henceforth  called  “stability  factor” .  Note,  that  a  scale 
transformation,  e.g.  to  sjj  =  1 ,  does  not  improve  the 
numerical  stability  of  the  solution. 
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Figure  2:  Stability  factor  ofSn  (i)  for  the  Epanech- 
nikov  weight,  function,  p  =  3  and  n  “  1000  equidistant 
(above)  and  unifomily  distributed  (below)  design  points, 
depending  on  Xo  and  h. 

Under  common  assumptions,  from  (8)  we  get  an 
asymptotic  representation 

S^k 

==  5'n,j+*-2 

=  nj{u-  K  /(«)  (1  +  0(1)) 

=  n  f{xo)  J  Kiz)  dz  (1  +  0(1)) .  (23) 

Singularity:  Formula  (23)  leads  to  theoretical  val¬ 
ues  of  Cholesky  factors  dj  and  the  stability  factor 
dp+i  /  «p+i,p+i  of  Sn  .  For  finite  samples,  the  term  /(xo) 
in  (23)  has  to  be  replaced  by  a  value,  which  only  depends 
on  the  shape  of  the  design  density  in  [xq  -  ft,  xq  +  A] . 
The  Cholesky  factors  are  of  order  dj  =  0(nft^-^~^)  as  is 
Sjj  .  Consequently,  if  the  number  of  points  in  the  local 
smoothing  interval  is  not  too  small,  the  stability  factor 
of  Sn  is  near  to  a  value,  which  does  not  depend  on  n  and 
ft,  but  only  on  the  weight  function  used. 

Figure  2  shows  the  stability  factor  of  Sn  in  (4).  As  will 
be  explained  below,  for  minimal  bandwidth  the  polyno¬ 
mial  weight  function  is  replaced  by  the  uniform  one.  The 
figures  show  a  plateau  which  is  close  to  the  theoretical 
value  0.229  even  for  small  bandwidths.  The  approxi¬ 
mation  is  extremely  good  for  the  fixed  design.  At  the 
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design 

Figure  3:  Stability  factor  (above)  of  Sn  m  (4)  for  h  = 
0.001 ,  the  uniform  weight  function,  p  =  3  and  n  =  100 
uniformly  distributed  design  points.  Solid  line  is  stability 
factor  for  sing  =  10"^,  dashes  are  stability  factor  for  sing 
=  10“30  Selow  are  corresponding  numerical  errors, 

boundaries  —  increasing  with  ft  —  the  stability  factor 
changes. 

As  to  be  expected  a  priori,  and  as  shown  by  the  fig¬ 
ures,  singularity  is  only  a  problem  for  small  bandwidths. 
Theoretically,  p  + 1  points  —  already  required  in  section 
2  —  are  sufficient  to  obtain  a  stable  solution.  However, 
in  practice  numerical  problems  may  arise,  basically  due 
to  two  reasons.  The  first  is  that  the  polynomial  weight 
function  decreases  the  influence  of  points  close  to  XQ±h. 
As  a  first  step  we  switch  to  uniform  weights  when  there 
are  only  p  + 1  points  in  the  interval.  Then  X  and  W  are 
nonsingular  (p+1)  x  (p-fl)  matrices,  and  from  (6)  the  so¬ 
lution  P  =  Y  is  independent  of  the  weight  function. 
Thus,  the  estimator  is  not  changed,  but  its  computation 
is  more  stable.  A  second  reason  for  stability  problems 
is,  that  in  the  random  design  case  design  points  may  lie 
close  together.  The  independence  of  the  stability  factor 
of  n  and  ft  gives  the  possibility  of  controlling  the  stabil¬ 
ity  of  the  normal  equations.  Sn  is  defined  to  be  singular, 
if 

dp+i  /sp+i,p+i  <  sing  X  “theoretical  value”  , 

where  the  theoretical  value  is  derived  from  (23)  and 
“sing”  can  be  choosen  by  the  user.  However,  the  size 
of  the  parameter  is  not  critical.  After  careful  evaluation 
the  standard  value  was  set  sing  =  0.01 .  The  theoretical 
values  used  depend  only  on  p  and  the  weight  function 
and  are  given  in  advance.  If  Sn  is  singular,  the  local 
smoothing  interval  is  enlarged  by  one  point. 
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Figure  4:  Stability  factor  (above)  of  Sn  in  (4)  for  the 
Epanecknikov  weight  function^  p  =  5  and  n  =  1000 
equidistant  design  points.  Solid  line  is  numerical  sta¬ 
bility  factor  without  restart;  dotted  line  is  stability  factor 
with  restarts j  graphically  indistinguishable  from  the  true 
stability  factor.  Below  are  numerical  errors  for  2/  =  4 
without  (solid  line)  and  with  (dots)  restart. 

Figure  3  shows  this  modification  when  applied  to  the 
stable  algorithm  described  in  section  3.3.  Us¬ 

ing  sing  =  0.01  only  a  few  local  smoothing  intervals  are 
changed,  but  the  algorithm  is  much  more  stable. 
Stability  of  updating:  As  noted  above,  the  updating 
procedure  for  computing  moments  in  the  matrix  Sn  may 
lead  to  substantial  round-off  errors.  The  aim  is  to  detect 
such  departures  and  to  restart  the  updating  procedure. 
The  computation  of  the  stability  factor  of  Sn  uses  all 
moments  mj  in  a  complex  manner,  and  hence  allows 
the  possibility  of  controlling  numerical  stability  of  the 
updating  algorithm. 

Figure  4  (above)  shows  the  numerical  stability  factor 
of  Sn  in  (4)  without  and  with  restarts.  Note,  that  the 
algorithm  starts  at  both  ends  and  runs  to  the  middle 
of  the  interval.  Data  are  generated  for  a  polynomial  of 
order  5,  so  that  a  straight  line  for  =  4  is  estimated. 
The  figure  illustrates,  that  the  stability  factor  can  serve 
as  a  device  for  detecting  accumulation  of  round-off  errors 
in  5n. 

We  use  the  stability  factor  at  the  last  restart  as  bench¬ 
mark,  and  update,  as  long  as 

1  “computed  stability  factor” 

stab  “stability  factor  at  last  restart” 

The  success  of  this  restart  rule  using  stab  =  0.95  is 
demonstrated  in  figure  4  (below).  Here  is  only  1  ad¬ 
ditional  restart,  but  numerical  stability  is  greatly  im¬ 


proved. 

4  Evaluation  of  algorithms 

Two  aims  are  pursued  in  this  section: 

•  to  check  and  compare  numerical  stability, 

•  to  evaluate  computational  speed. 

The  scope  of  the  evaluation  is  as  follows: 

•  The  range  of  bandwidths  goes  from  the  minimal  to 
the  maximal  one. 

•  Interest  is  focussed  on  derivatives  of  order  j/  = 
0,1,2,  while  u  —  3,4  are  of  interest  to  estimate 
smooth  functionals  of  r^^\ 

•  Polynomial  orders  p  =  i/  -f  1  are  of  prime  interest 
and  p  =  1/  “f  3  is  still  of  sufficient  interest  to  warrant 
full  evaluation.  Higher  order  polynomials  around 
p  =  10  illustrate  the  range  of  applicability. 

4.1  Realization  of  algorithms 

The  following  three  algorithms  are  considered: 

conventional:  the  conventional  0(n^/®)  algorithm 
based  on  (11).  In  fact  the  conventional  algorithm 
should  use  (8),  but  for  polynomial  weight  functions 
(11)  is  only  a  slight  modification. 

fast:  the  fast  0{n)  algorithm  derived  in  section  3.3, 
based  on  updating  normal  equations,  exact  binning, 
centering  while  moving,  controlling  ill-conditioned 
situations  for  small  bandwidths  and  data-tuned 
restarting  the  updating  procedure. 

stable:  the  superstable  0(11^^^)  algorithm  as  “fast”,  but 
restarting  at  every  output  point  (no  updating). 

The  algorithms  were  realized  in  Fortran  77  with  dou¬ 
ble  precision  on  a  Sun  IPX-workstation.  They  have  ad¬ 
ditional  common  features: 

•  To  reduce  numerical  boundary  problems,  the  algo¬ 
rithms  start  at  both  ends  and  run  to  the  middle  of 
the  estimation  interval. 

•  The  algorithms  use  Cholesky  decomposition  with 
parameters  sing  and  stab  described  in  section  3.4. 

•  When  solving  the  normal  equations,  the  coefficient 
matrix  Sn  is  assumed  to  be  nonsingular.  In  theory 
this  is  fulfilled,  if  the  number  of  observations  in  the 
local  smoothing  interval  [aro  —  ft,  ico  +  ft]  is  at  least 
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Figure  5:  Elapsed  time  (in  seconds)  of  different  algo- 
riikms  for  i/  =  0  and  p  —  I  depending  on  sample  size 
n.  Solid  line  is  conventional,  dashes  are  stable,  and  dots 
are  fast  algorithm. 

p  +  L  Consequently,  in  case  of  a  numerically  singu¬ 
lar  matrix  (see  section  3.4),  the  local  bandwidth  is 
increased. 

•  If  the  number  of  observations  in  the  local  smoothing 
interval  is  minimal,  uniform  weights  are  used.  If 
this  number  is  p  +  1,  this  gives  the  same  estimator 
as  polynomial  weights.  However,  the  unweighted 
estimator  is  numerically  more  stable. 

•  Updating  saves  computing  time  but  possibly  costs 
in  numerical  stability.  We  should  restart  if  the  situ¬ 
ation^  is^extremolyinsiable  or iLan  up-date  does  not 
save  time.  Hence  a  restart  is  forced  if  the  number  of 
observations  is  minimal,  or  if  an  update  would  re¬ 
move  more  than  one  third  of  the  observations  used. 

4.2  The  design  of  the  case  study 

The  designs  considered  were  fixed  and  random  on  [0, 1] 
with  uniform  {f(x)  =  1 ),  linear  {f{x)  =  2x)  and  trun¬ 
cated  normal  (f(x)  =  (p{2  x  —  1)  /,(2  ^(1)  —  1) )  densities. 
The  number  of  observations  runs  from  n  =  10  to  10000 , 
focussing  evaluations  on  n  =  1000 .  Regression  func¬ 
tions  are  polynomials,  thereby  avoiding  problems  with 
bias.  Exact  observations  and  observations  with  normal 
errors  were  used.  The  Epanechnikov  weight  function  was 
chosen  because  of  its  optimality. 

4.3  Computational  speed 

Figure  5  compares  elapsed  time  of  algorithms  as  a  func¬ 
tion  of  sample  size  n  for  random  uniform  design  on  [0,1], 
equidistant  output  grid  with  m  =  n  points,  and  band- 
widths  h{n)  =  0.2  The  stable  and  fast  algorithms 


Figure  6:  Elapsed  time  (in  seconds)  of  different  algo¬ 
rithms  for  z/  =  0  and  p  =  1  depending  on  bandwidth  ( on 
logarithmic  scale).  Solid  line  is  conventional,  dashes  are 
stable,  and  dots  are  fast  algorithm. 

used  bins  containing  about  (2n/i)^/^  observations.  The 
fast  algorithm  is  to  a  good  approximation  0{n).  From 
n  =  1000  to  10000  elapsed  time  increased  by  a  factor  14 , 
slightly  more  than  the  factor  10  ideally  expected.  These 
results  were  confirmed  for  other  situations. 

A  further  point  of  interest  is  computational  speed 
with  respect  to  bandwidth.  For  fixed  sample  size, 
elapsed  time  of  the  conventional  algorithm  is  about  pro¬ 
portional  to  h.  The  speed  of  fast  algorithms  is  expected 
to  be  approximately  independent  of  h. 

Figure  6  illustrates  how  elapsed  time  depends  on  h  for 
equidistant  design  and  output  grid  on  [0, 1]  with  m  = 
n  =  1000  points.  The  stable  and  fast  algorithms  used 
bins  of  same  length  as  in  figure  5,  i.e.  they  contained  10 
observations.  In  fact,  elapsed  time  of  the  fast  algorithm 
is  almost  constant.  For  graphical  reasons  the  time  axis 
was  cut.  The  conventional  algorithm  needed  6.9  seconds 
for  /i  =  0.5,  compared  with  0.1  seconds  for  fast  and 
superstable  algorithms. 

The  elapsed  time  of  the  fast  algorithm  was  compared 
with  that  of  the  fast  Fourier  transform  (Rabiner  &  Gold, 
1975,  p.  367).  Evidently,  the  FFT  is  in  general  not  ap¬ 
plicable  to  estimating  the  regression  function  r  (or  its 
derivatives)  in  model  (1),  due  to  inherent  restrictions 
with  respect  to  design,  boundary  problems  etc.  Due  to 
its  well-known  good  performance  in  terms  of  speed  it 
can  be  taken  as  a  benchmark  in  this  respect.  In  the 
case  whith  n  =  2*  equidistant  design  points,  z/  =  0  and 
m  =  n,  which  is  ideal  for  the  FFT,  our  fast  algorithm 
needed  only  70  %  more  time. 

The  attractive  computational  efficiency  of  updating 
algorithms  has  also  been  confirmed  by  Fan  ic  Marron 
(1993)  in  a  comparison  with  existing  fast  algorithms. 
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Table  1:  Maximal  relative  numerical  errors  rdisi  of 
algorithms  over  h  for  exact  data,  m  ^  n  ^  1000  and 
p  =  +  1 ,  using  sing  =  10”^  and  stab  =  0.99 . 


design 

1/ 

maxh  rdist  (/i 
convent  |  stable 

') 

fast 

fixed 

0.19E-13 

0.19E-14 

0.27E-11 

uniform 

1 

0.24E-11 

O.llE-11 

0.18E-07 

2 

0.23E-08 

0.12E-09 

0.28E-04 

3 

0.23E-05 

0.52E-06 

0.23E-02 

4 

0.80E-02 

0.23E-04 

0.32 

random 

0 

0.19^12 

0.33E-14 

0.30E-13 

uniform 

'  1 

0.55E-10 

O.llE-09 

O.llE-09 

2 

0.57E-06 

0.13E-07 

0.13E-07 

3 

0.57E-02 

0.15E-05 

0.15E-05 

4 

0.31 

0.71E-03 

.0.71E-03 

fixed 

0 

0.55E-13 

0.43E-14 

0.17E-10 

linear 

1 

0.13E-11 

0.57E-12 

0.71E-08 

2 

0.33E-08 

0.61E-09 

0.28E-08 

3 

0.10 

0.44E-06 

0.80E-06 

4 

1.9 

0.14E-03 

0.62E-03 

random 

0 

0.341^13 

0.13E-13 

0.33E-13 

linear 

1 

0.40  E-10 

0.63E-10 

0.63  E-10 

2 

0.63E-01 

0.83E-08 

0.83E-08 

3 

0.10 

0.17E-05 

0.17E-05 

4 

2.0 

0.27E-03 

0.27E-03 

fixed 

0 

0.41E-13 

0.27E-14 

0.32E-11 

normal 

1 

0.18E-11 

0.52E-12 

0.31E-10 

2 

0.29E-08 

0.16E-09 

0.69E-08 

3 

0.27E-05 

0.21E-06 

0.53E-06 

4 

0.97E-02 

0.16E-04 

0.19E-04 

random 

0 

0.13E-12 

0.33E-14 

0.14E-13 

normal 

1 

0.71E-10 

O.lOE-09 

O.lOE-09 

2 

0.22E-01 

0.32E-07 

0.32E-07 

3 

0.31E-01 

0.46E-05 

0.46E-05 

4 

2.0 

0.41E^03 

0.41E-03 

4.4  Numerical  stability 

To  check  numerical  stability  the  relative  distance  in  sup- 
norm  is  used 

max  I  r^^\xj)  —  r^^\xj)  \ 
rdist  =  — — - - — —  ,  (24) 

—  X)  I  -  rC")  I 

j=i 

where  r^^\x)  denotes  the  “true”  estimate,  is  the 

result  of  an  algorithm,  and  r(*')  is  the  mean  of  true  esti¬ 
mates.  Also  the  following  mean  distance 

m 

2  I  I 

mdist  =  - — — — —  (25) 

2  I  r(‘')(a:j)  -  rC")  | 

j=i 


Table  2:  Maximal  mean  numerical  errors  mdist  of  al¬ 
gorithms  over  h  for  exact  data,  m  =  n  =  1000  and 
p  =  z/  +  1 ,  using  sing  =  10*“^  and  stab  =  0.99 . 


design 

1/ 

max/i  mdist  (/i) 

convent  |  stable  [  fast 

fixed 

0.14E-14 

0.81E-15 

0.40E-12 

uniform 

1 

0.26E-13 

0.13E-13 

0.15E-08 

2 

0.80E-11 

0.72E-11 

0.77E-06 

3 

0.65E-08 

0.17E-08 

0.56E-04 

4 

0.23E-04 

0.26E-06 

0.53E-02 

random 

0 

0.19E-13 

0.13E-14 

0.39E-14 

uniform 

1 

0.75E-12 

0.97E-12 

0.97E-12 

2 

0.14E-08 

0.14E-09 

0.14E-09 

3 

0.94E-05 

0.28E-07 

0.28E-07 

4 

0.39E-03 

0.42E-05 

0.42E-05 

fixed 

0 

0.37E-14 

0.91E-15 

0.20E-11 

linear 

1 

0.52E-13 

0.36E-13 

0.27E-09 

2 

0.35E-10 

0.27E-10 

0.44E-10 

3 

0.79E-03 

O.llE-07 

O.llE-07 

4 

0.33E-02 

0.17E-05 

0.17E-05 

random 

0 

0.31E-14 

0.80E-15 

0.44E-14 

linear 

1 

0.64E-12 

0.75E-12 

0.76E-12 

2 

0.27E-03 

0.16E-09 

0.16E-09 

3 

0.13E-02 

0.29E-07 

0.29E-07 

4 

O.llE-01 

0.33E-05 

0.33E-05 

fixed 

0 

0.26E-14 

0.76E-15 

0.86E-12 

normal 

1 

0.39E-13 

0.39E-13 

0.61E-11 

2 

0.16E-10 

0.95E-11 

O.lOE-08 

3 

0.75E-08 

0.31E-08 

0.31E-08 

4 

0.24E-04 

0.37E-06 

0.37E-06 

random 

0 

0.31E-14 

O.llE-14 

0.19E-14 

normal 

1 

0.67E-12 

O.llE-11 

0.12E-11 

2 

0.53E-04 

0.15E>-09 

0.15E-09 

3 

0.59E-04 

0.27E-07 

0.27E-07 

4 

0.21E-02 

0.40E-05 

0.40E-05 

is  used.  The  weaker  criterion  “mdist”  may  be  relevant 
in  those  cases  where  only  a  smooth  functional  of  is 
of  interest,  as  is  often  the  case  for  z/  =  3,4. 

When  inspecting  stability  across  many  situations, 
problems  can  arise  typically  for  small  bandwidths.  This 
result  should  be  kept  in  mind  when  judging  tables  1  and 
2,  which  give  maximum  numerical  error  across  band¬ 
width  h  for  supremum  and  for  mean  distance. 

Table  1  shows  maximal  relative  numerical  error  rdist 
of  algorithms  over  h  =  0.001,  0.005,  0.01,  0.05,  0.1,  0.2, 
0.3,  0.4,  0.5  for  n  =  1000  design  points  and  m  =  1000 
equidistant  output  grid  points.  The  regression  functions 
are  polynomials  of  order  p  =  z/  -f  1 .  The  data  are  exact 
without  random  errors.  The  function  to  be  esti¬ 
mated  always  is  the  straight  line  from  —  1  to  1 . 

Table  2  gives  maximal  mean  numerical  error  mdist 
over  h  in  the  same  situation.  The  numerical  accuracy  is 
good  to  very  good  for  u  ranging  from  0  to  3 .  For  z/  =  4 
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the  fast  algorithm  may  break  down  in  terms  of  maxi¬ 
mal  numerical  error  rdist,  but  still  is  useful  in  terms  of 
mdist.  This  shows  that  there  are  only  isolated  problems 
with  numerical  accuracy  and  this  has  been  confirmed 
graphically. 

For  p  =  the  precision  of  the  superstable  and  fast 
algorithms  is  reduced  by  a  factor  of  about  10 .  The  con¬ 
ventional  algorithm  has  problems  at  the  boundaries  for 
higher  order  polynomials  because  of  the  ill-conditioned 
normal  equations  there.  The  superstable  and  fast  algo¬ 
rithms,  however,  even  work  stably  in  terms  of  rdist  for 
1/  =  1, . .  .,4,  p  =  10, 11,  such  that  p  -  r-  is  odd.  As 
expected,  they  are  no  longer  fast  then,  and  one  might 
in  these  cases  prefer  the  superstable  algorithm  from  the 
beginning.  The  conclusions  were  confirmed  by  data  with 
random  noise  and  nonpolynomial  regression  functions. 

4.5  Conclusions 

We  derived  a  fast  algorithm,  which  is  stable  over  the 
whole  region  of  interest,  i.e.  up  to  polynomials  of  order 
about  10.  The  conventional  algorithm  has  problems  in 
terms  of  stability  for  very  small  bandwidths  and  at  the 
boundary.  The  superstable  algorithm  proved  to  be  more 
stable  than  the  conventional  one,  and  is  at  the  same 
time  much  faster.  It  is  attractive  that  the  algorithms 
allow  fitting  of  curves  as  well  as  derivatives,  both  for  a 
global  or  local  bandwidth  choice. 
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Abstract 

Given  random  variables  X  €  IR^  and  Y  such  that 
E[Y\X  =  x]  =  m(x),  the  average  derivative  is  defined 
as  So  =  E[Vm(2C)],  i.e.,  as  the  expected  value  of  the 
gradient  of  the  regression  function.  Average  derivative 
estimation  has  several  applications  in  econometric  the¬ 
ory  (Stoker,  1992)  and  thus  it  is  crucial  to  have  a  fast 
implementation  of  this  estimator  for  practical  purposes. 

We  present  such  an  implementation  for  a  variation 
known  as  density-weighted  average  derivative  estima¬ 
tion.  This  algorithm  is  based  on  the  ideas  of  binning 
or  Weighted  Averaging  of  Rounded  Points  (WARPing). 
The  basic  idea  of  this  method  is  to  discretize  the  original 
data  into  a  d-variate  histogram  and  to  replace  in  the  non- 
parametric  smoothing  steps  the  actual  observations  by 
the  appropriate  bincenters.  The  non-parametric  smooth¬ 
ing  steps  become  thus  a  (multi-dimensional)  convolu¬ 
tion  between  the  (discretized)  data  and  the  (discretized) 
smoothing  kernel. 

A  Monte-Carlo  study  demonstrates  that  with  this 
binned  implementation  substantial  reduction  in  comput¬ 
ing  time  can  be  achieved.  But  it  will  also  become  clear 
that  in  higher  dimension  the  choice  of  how  to  bin  is 
crucial. 

1  Introduction 

Average  derivative  estimation  tries  to  estimate  the  mean 
slope  of  the  conditional  mean  of  the  response  variable, 
i.e.,  given  a  response  variable  Y,  whose  expectation  is 
assumed  to  depend  on  a  d-dimensional  variable  X  via  a 
smooth  function  m,  the  aim  of  average  derivative  esti¬ 
mation  is  to  estimate  the  average  slope  of  this  function. 
In  other  words,  if 

E[Y\X  =  a:]  =  m{x) 

and  V  denotes  the  gradient  of  partial  derivatives  with 


respect  to  the  coordinates  of  Xy  the  aim  is  to  estimate 

So  =  E[Vm{X)]  (1) 

respectively  a  weighted  version 

Srv  =  E[Vm{X)w{X)]  (2) 

where  «;(•)  is  a  non-negative  weight  function.  If  we 
choose  as  weight  function  w(x)  =  /(ar),  the  marginal 
density  of  X,  our  estimand  becomes: 

S  =  E[Vm(X)f{X)] 

=  -2E[YVfiX)]  (3) 

Where  (3)  follows  by  partial  integration.  The  prob¬ 
lem  of  estimating  the  density- weighted  average  deriva¬ 
tive^  as  given  by  (3),  was  studied  by  Powell,  Stock  and 
Stoker  (1989). 

Average  derivative  estimation  can  be  used  in  many 
econometric  models  (Stoker,  1992;  Hardle,  Hildenbrand 
and  Jerison,  1991).  As  one  example,  we  want  to  mention 
single-index  models  (also  called  one-term  projection  pur¬ 
suit  models).  In  these  models  the  regression  function  m 
has  the  form 

m(r)  =  gix'^/S),  (4) 

where  g  is  an  unknown  univariate  function  and  /?  is  a 
d-dimensional  (projection)  vector.  Stoker  (1986)  gives 
an  extensive  discussion  and  motivation  for  models  of  the 
form  (4).  The  semiparametric  model  (4)  covers  a  broad 
range  of  important  parametric  models  such  as  probit  and 
logit  models,  censored  regression,  Tobit  models  etc. 

It  is  easy  to  see,  that  in  this  case  we  have 

Vm(ar)  =  g'{x'^P)P 

and  thus 

6Q  =  E[g'{X'^p)]p  and  6^  =  ^\g'{X'^  p)w{X)]p. 
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This  means  that  (weighted)  average  derivative  estima¬ 
tion  allows  us  to  estimate  the  unknown  projection  up 
to  a  scale  constant.  This  is  in  fact  the  best  we  can  do 
in  the  semiparametric  single- index  model  given  by  (4). 
If  the  pair  (5,/?)  fulfills  model  (4)  then  for  any  c  e  R, 
c  0,  the  pair  {g,  P)  with 

g{*)  =  g{*lc)  and  ^  =  c/? 

does  so  too. 

The  rest  of  this  article  is  structured  as  follows,  Sec¬ 
tion  2  will  describe  the  density- weighted  average  deriva¬ 
tive  estimator  as  proposed  by  Powell  et  al.  (1989).  In 
Section  3  we  will  propose  how  to  implement  this  estima¬ 
tor  using  binning  ideas  and  to  achieve  thus  considerable 
run-time  gains.  Finally  in  Section  4  we  will  discuss  some 
further  points  related  to  the  binning  method. 


2  Direct  implementation 


2.1  Estimator  for  8 


To  estimate  the  density- weighted  average  derivative  6, 
Powell  et  al.  (1989)  propose  to  estimate  the  gradient  of 
the  marginal  density  of  the  X  variables  imnparametri- 
cally  at  each  observation  point  by,  say,  Vf{xi).  Their 
estimator  for  6  is 

6  =  (5) 

which  can  be  motivated  as  a  method  of  moment  estima¬ 
tor  in  which  the  unknown  function  V/  is  replaced  by  a 
nonparametric  estimate  of  it. 

To  estimate  the  gradient  of  /  nonparametrically,  Pow¬ 
ell  et  al.  (1989)  use  the  gradient  of  a  multivariate  kernel 
density  estimator  (Silverman,  1986;  Scott  1992).  Given  a 
d- variate  kernel  K.  (think  of  /C  as  a  d- variate  density  func¬ 
tion)  and  a  d  X  d  positive  definite  matrix  H  of  smooth¬ 
ing  parameters  a  nonparametric  estimate  of  the  marginal 
density  /  at  a  point  x  G  R^  would  be 

t  =  l 

For  numerical  ease,  a  common  choice  is  to  take  K 
as  a  product  of  d  univariate  kernels  K,  and  to  re¬ 
duce  H  to  a,  diagonal  matrix,  so  that  we  have  only  a 
d-dimensional  vector  h  of  smoothing  parameters.  Wand 
and  Jones  (1993)  discuss  for  the  two-dimensional  case 
the  implications  of  this  simplification.  With  this  choices 
(6)  simplifies  to 


fh{x) 


(7) 


where  x  =  {xi,...,  Xdf  and  xj  =  {xji, Xjd)'^  ■ 
Powell  et  al.  (1989)  do  not  use  the  nonparametric  den¬ 
sity  estimator  given  in  (7)  directly,  but  a  leave-one-oui 
version  of  it.  (For  this  reason  the  estimator  6  has  a  17- 
statistic  structure  and  can  be  easily  analyzed.)  Thus  to 
estimate  the  marginal  density  /  at  the  observation  Xi, 
they  drop  Xi  from  the  sample  and  calculate  fhi^i)  from 
the  remaining  sample  (of  size  n  —  1).  As  a  further  simpli¬ 
fication  they  use  only  one  bandwidth  for  all  dimensions. 
So  the  estimator  Vf(xi)  which  they  use  in  (5)  is: 


with  Khi^)  ~  K{u/h)/h. 


2.2  Asymptotic  properties 

Powell  et  al.  (1989)  showed  that  under  certain  regularity 
conditions  and  a  suitable  choice  for  K  and  the  rate  with 
which  h  tends  to  zero,  the  estimator  6  given  in  (5)  is 
consistent  and  has  an  asymptotic  normal  distribution. 
More  specifically  they  proved  that 

where 

E  =  AE[r{X,Y)r{X,Yf]-m'^, 

r(x,  y)  =  f{x)Vm{x)  -  {y  -  m{x)}'7f{x). 


2.3  Estimator  for  the  variance 

To  estimate  the  asymptotic  variance  E  of  6  Powell  et 
al.  (1989)  propose  to  estimate  r(x,-,j/i)  by; 

»)  =  t  (^)  (»  -»)  w 

j  =  l 
j^l 

and  thus  E  by: 

n 

'^r{xi,yi)f{xi,yi)'^ 

t  =  - —  466'^.  (10) 

n 

In  the  next  section  we  will  discuss  how  fast  implemen¬ 
tations  for  8  and  E  can  be  obtained  by  using  binning 
techniques. 
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3  Binned  implementation 

3.1  Basic  idea 

The  basic  idea  of  binning  methods  is  to  replace  each  ob¬ 
servation  of  Xi  by  the  nearest  point  bz  from  a  regular 
spaced  grid.  To  fix  ideas  consider  kernel  density  estima¬ 
tion  in  the  one-dimensional  case, 

=  (11) 

i=i 

and  take  the  regular  grid  {bz  :  bz  =  zA^  z  E  2S}  where  A 
is  a  fixed  constant,  the  binwidth.  Replacing  now  each  Xi 
in  (11)  by  the  nearest  bz^  we  see  that  we  have  to  evaluate 
the  kernel  K  only  at  integer  multiple  of  A/h: 

t/;,  =  i/C  (y/)  ,  l  =  -L,...L  (12) 

Here  L  is  chosen  such  that  AL/h  ^  \  \i  K  has  com¬ 
pact  support  on  [—1, 1]  (if  K  is  the  Gaussian  kernel,  i.e., 
the  kernel  has  no  compact  support,  Wand  (1993)  rec¬ 
ommends  AL/h  4).  If  we  denote  further  by  the 
number  of  observations  xi  which  have  bz  as  their  nearest 
point  in  the  grid,  we  see  that  we  can  approximate  (11) 
by  (let  bz  be  the  point  nearest  to  xi): 


1 

fa  hj  is  nearest  to  Xj 


1  ^ 

= 


Wz^ini, 


The  last  formula  is  a  discrete  convolution  between  the 
vector  of  weights  (the  discretized  kernel)  and  the  vector 
of  bincounis  Uz  (the  discretized  data). 

Silverman  (1982)  uses  a  fast  fourier  transformation 
to  calculate  this  discrete  convolution.  Another  algo¬ 
rithm  which  does  not  use  the  fast  fourier  transform  is 
given  in  Scott  (1985)  (see  also  Hardle  and  Scott,  1992; 
Hardle,  1991).  Fan  and  Marron  (1994)  describe  how  to 
use  these  ideas  for  other  nonparametric  curve  smoothers. 

Fan  and  Marron  (1994)  also  quantify  the  run-time 
gains  achievable  using  these  ideas.  These  run-time  gains 
are  mainly  due  to  two  facts.  First  we  have  much  less 
kernel  evaluations,  in  fact  we  have  to  evaluate  the  ker¬ 
nel  only  once  on  a  finite  grid  of  points.  Secondly,  once 
the  data  is  discretized  the  nonparametric  curve  smoother 
is  estimated  at  the  grid  points  bz  and  not  at  the  origi¬ 
nal  observations  Xi,  Usually  the  number  of  grid  points 


at  which  the  smoother  is  evaluated  is  (much)  smaller 
than  n.  The  estimate  at  an  original  observation  Xi  is  ei¬ 
ther  taken  as  the  estimate  at  the  nearest  bz  or  obtained 
by  linear  interpolation  between  the  estimates  of  the  two 
nearest  grid  points  (Jones,  1989). 

A 

3.2  Application  to  S 

The  ideas  presented  in  Section  3.1  above  are  readily  ex¬ 
tendable  to  the  multivariate  case  (Wand,  1993)  and  to 
the  estimator  S. 

Again  we  define  a  (multivariate)  grid  of  equidistant 
points  bz  €  and  replace  Xi  G  by  the  near¬ 
est  bz*  To  fix  ideas  let  A  =  (Ai, . . A^)^  be  a  fixed 
d-dimensional  vector  and  define  bz  by 

b^  =  zA  =  Ai, . . . ,  ZdAif 

for  each  multi-index  z  —  {zi,...,  ZdY  €  Note  the 
pointwise  multiplication  of  the  vectors  z  and  A  above. 
In  the  rest  of  this  article,  if  not  indicated  differently,  we 
mean  this  kind  of  pointwise  vector  multiplication  rather 
then  the  standard  matrix  multiplication  when  we  multi¬ 
ply  two  vectors. 

For  each  z  E  let  again  riz  denote  the  number  of 
observed  Xi  for  which  bz  is  the  nearest  grid  point.  For  a 
binned  implementation  of  the  estimator  V/  we  also  need 
to  discretize  the  derivative  of  the  kernel  K: 


and  define  wij  analogous  to  (12)  by  replacing  A  by  Ay.  If 
we  define  now  for  each  multi-index  /  =  (/i, . . . ,  UY  ^ 
the  corresponding  weight  w\  E  R^  by: 

(  Wl^lWi^2’-WUd  ^ 

u;'=  . 

\  / 

we  see  that  analogous  to  the  example  in  Section  3.1  a 
binned  version  of  the  estimator  V/  is: 

w.)  =  — I  E  (14) 

Note  that  the  sum  in  (14)  is  actually  a  sum  over  d  indices 
/i,..  .,/4,  each  Ij  taking  values  from  — Ly  to  Xy,  j  = 
1, . . . ,  d.  Also,  the  multi-index  z  —  /  in  (14)  is  z  —  /  = 
{zi-h,...,Zd-ldf. 

Thus  a  binned  version  of  the  density- weighted  average 
derivative  6  is: 

5  =  --  V  n,y,ff{h)  (15) 

n  ^ 


B.  Turlach  3 1 


where  is  the  average  over  all  observation  y,-  such  that 
bz  is  the  nearest  grid  point  to  the  corresponding  Xi. 
Note  that  the  summation  in  (15)  is  actually  only  over 
all  zeZ'^  such  that  7^  0  and  is  not  an  infinite  sum. 
Furthermore,  if  we  compare  (5)  with  (15)  we  see  that 
the  only  approximation  error  we  do  is  due  to  replacing 
^f{xi)  by  ^f{bz).  With  respect  to  the  y  we  “keep  the 
full  resolution” . 


3.3  Application  to  S 

In  this  section  we  will  discuss  the  implementation  of  a 
binned  estimator  for  the  asymptotic  variance  S  given 
in  Section  2.2.  A  naive  way  of  implementing  such  an 
estimator  would  be  to  plug  into  (10)  a  binned  estimate, 
say,  f{bz)  for  f{xi,yi),  given  in  (9),  to  obtain: 

^  r{bz)rif>zf 

- (16) 

n 

with  S  from  (15).  The  binned  estimate  f(&i)  is  easily 
derived  in  the  same  way  as  demonstrated  in  Section  3.1. 
Let  bz  be  the  grid  point  nearest  to  Xf,  then  we  have; 


r{^i,yi)  = 


^=1 

w - -V w'liyi  -  yj)Mi  is  nearest  to  Xj 

n-’i  "  ^ 

j=l 

1  ^ 

= - r  Y)  K-lMVi  -  yi)  =  r{bz,yi) 

1  ^ 

« - Y  w'^_,ni{yz  -  yi)  =  f(bz) 

n  —  i 


Note  that  the  only  approximation  error  in  f{bz ,  yi)  is  due 
to  replacing  the  Xj  by  the  grid  point  bz  •  Thus  for  f{bz ,  yi) 
we  have  still  the  full  resolution  in  the  j/-direction.  Only 
if  we  go  to  r{bz)  we  make  an  approximation  error  in  that 
direction  too.  The  motivation  for  this  approximation  is, 
that  if  several  Xj  exist  which  have  bz  as  nearest  grid  point 
then  we  should  average  over  the  corresponding  r{bz,yi) 
to  get  a  unique  estimate  f(6z)  at  6.8. 

However,  the  binned  implementation  which  we  get 
if  we  insert  f(6^)  in  (16)  does  not  work.  The  rea¬ 
son  for  this  is  explained  and  graphically  illustrated  in 
Proen?a  and  Turlach  (1994).  On  one  side  we  make  an 
approximation  error  in  the  y-direction  by  going  from 
r{bz,yi)  to  f(6.8)-  On  the  other  side  we  want  to  approx¬ 
imate  f(xi,yi)f(xi,y,)^  which  involves  a  squared  term 


in  y.  Thus  we  have  to  take  into  account  what  Proen^a 
and  Turlach  (1994)  call  the  wHhin-bin-variability  of  y. 
This  means  that  we  can  not  find  a  binned  estimator  for 
r{xi,yi)r{xi,yi)'^  by  finding  one  just  for  f{xi,yi),  but 
that  we  really  have  to  consider  this  product  directly. 
Hence  a  “correct”  binned  estimator  can  be  found  by  ob¬ 
serving  that: 

f(!C.-,yi)f(xt,yi)^  « 

^  f(6z,  yi)i’{^z,yi) 

/  -  \  2  L  L 


■  ^  l=-LV=-L 


iT 

■  i^z-l'  X 


niivi  -yi)nii{yi  -  yv) 


«;(yi  —yz  +  yz  —  yi)nv{yi  —  Vz+yz—  yrO 


=  f{bz)f{^z)'’'  + 


ninv{yi  -  yz)(2y»  -  yj  -  y;')| 


And  thus  the  sum  2/»)^(®i>  2/»)^  approx- 

imated  as: 

n 

]^f(xi,yi)f(xi,y,)^  « 

«=i 

-EE  f{^z,yi)r{^z,yi)'^ 

=  H  |f(6z)f(6,)^-l- 

zi2Z^  ^ 

{zYi  i,nini>nziyj  -  yz)| 


_  ^j.T 


Note  that  because  of  the  summation  over  i  the  term 
which  includes  (yj  —  yz){‘^yz  —  Vi  —  yi')  drops  out,  i.^ 
the  sum  is  zero.  Also,  y^  denotes  the  square  of  yz  and  y| 
denotes  the  mean  of  all  yj  such  th^  Xj  has  bz  as  nearest 
grid  point.  This  term,  namely  n.8(y|  -  vj),  measures  the 
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variability  of  Y  around  the  grid  point  bz .  This  term  is 
obtained  by  expanding  (y,*  —  and  summing  over  z. 
Note  that  if  we  choose  A  so  small,  that  each  grid  point  hz 
has  at  most  one  observation  Xi  for  which  it  is  the  nearest 
point  then  all  of  these  within-bin- variability  terms  vanish 
and  the  binned  estimator  given  in  (16)  would  be  correct. 

However,  in  general  we  have  to  take  these  terms  into 
account.  Thus  a  “correct”  binned  estimator  for  the  vari¬ 
ance  matrix  is  given  by 

n 

with  S  from  (15). 

4  Closing  remarks 

In  the  previous  section  we  demonstrated  how  the  simple 
and  intuitive  basic  binning  idea  can  be  applied  to  the 
density- weighted  average  derivative  estimator  6  and  the 
estimator  of  the  asymptotic  covariance  matrix  E.  Some 
questions  still  remain  which  we  would  like  to  address 
here. 

From  (14)  we  see  that  V/(6^)  is  a  discrete  convolu¬ 
tion,  the  same  is  true  for  r{bz)  and  rr^.  How  should  we 
calculate  this  discrete  convolution?  As  mentioned  above 
Silverman  (1982)  and  Wand  (1993)  use  a  fast  fourier 
transformation.  However,  this  method  is  inappropriate 
in  our  situation  since  we  are  only  interested  to  calculate 
these  estimates  at  the  points  bz  which  have  some  ob¬ 
servation  close  enough  to  them,  i.e.,  for  which  /  0. 
But  a  fast  fourier  transformation  method  would  calcu¬ 
lated  these  estimates  at  all  grid  points  6^.  Just  imagine 
the  case  where  we  have  a  two-dimensional  A-variable 
and  we  choose  our  grid  such  that  we  have  100  different 
grid  points  in  each  dimension.  The  complete  grid  will 
have  10.000  points  6^.  In  this  case  a  fast  fourier  trans¬ 
form  method  would  calculate  V/(6x:), ...  at  all  these  grid 
points.  Clearly  this  involves  many  unnecessary  calcula¬ 
tions  if  the  sample  size  is  not  too  big. 

The  fast  fourier  transform  approach  is  feasible  if  we 
need  estimates  at  all  grid  points  for  example  if  we  want 
to  make  a  plot.  But  it  is  also  not  clear  if  the  fast  fourier 
transform  is  the  fastest  method  in  such  a  case.  Fan  and 
Marron  (1994)  find  that  this  approach  is  not  the  fastest 
for  the  one-dimensional  case  whereas  Wand  (1993)  favors 
the  fast  fourier  transform  in  the  two-dimensional  case. 
Scott  (1992)  describes  alternative  algorithms  which  do 
not  use  a  fast  fourier  transform.  These  algorithm  step 
through  all  grid  points  bz  with  ^  0  and  just  do  the 
necessary  calculations  at  these  points  and  in  the  neigh¬ 
borhood  of  bz  (as  defined  by  the  Lj),  i.e.,  also  these  al¬ 
gorithms  calculate  the  estimates  on  the  whole  grid.  For 


the  discrete  convolution  necessary  here  we  recommend  to 
use  specialized  versions  of  the  algorithms  of  Scott  (1994) 
which  step  through  all  grid  points  bz  with  ^  0  and 
do  the  necessary  calculations  only  at  these  points. 

Closely  related  with  the  question  “How  to  perform 
the  discrete  convolution?”  is  the  question  “How  shall 
one  discretize  the  data?”.  Until  now  we  always  used 
a  kind  of  “histogram”  binning  in  which  Uz  was  integer 
and  each  observation  was  shifted  to  (replaced  by)  the 
nearest  grid  point  bz.  For  the  one- dimensional  density 
estimation  Jones  and  Lotwick  (1984)  proposed  an  alter¬ 
native  called  “linear”  binning.  In  this  variation  the 
are  no  longer  integer  and  each  observation  is  distributed 
onto  the  two  nearest  grid  points.  Hall  and  Wand  (1993) 
propose  further  variations  for  the  binning  procedure  and 
quantify  the  error  which  is  introduced  by  using  binning 
techniques  (see  also  Gonzalez-Manteiga,  Sanches-Sellero 
and  Wand,  1994). 

But  the  use  of  such  techniques  in  a  higher-dimensional 
setting  is  problematic.  A  binning  technique  like  “lin¬ 
ear”  binning  which  distributes  each  observation  in  one- 
dimension  on  two  grid  points,  will  distribute  each  obser¬ 
vation  in  d-dimension  onto  2^  grid  points.  This  could 
have  the  effect  that  we  have  more  grid  points  bz  with 
Uz  ^  0  than  observations!  Take  for  example  a  two- 
dimensional  standard  normal  variable  and  use  linear  bin¬ 
ning  with  a  grid  where  A  =  (0.03,  0.03)^.  If  the  sample 
size  is  n  =  250  we  have  on  the  average  950  grid  points  bz 
at  which  /  0.  The  result  of  this  is  that,  even  if  we  use 
the  algorithms  described  above  for  the  discrete  convolu¬ 
tion,  the  binned  implementation  using  “linear”  binning 
is  slower  than  the  direct  implementation. 

This  was  verified  in  a  Monte-Carlo  study  with  a  bi¬ 
variate  A- variable  (and  F  generated  according  to  a  lin¬ 
ear  model  and  a  probit  model).  Using  the  adapted  algo¬ 
rithms  from  Scott  (1992)  for  the  discrete  convolution  and 
“linear”  binning  hardly  no  run-time  gains  were  observed 
and  for  a  grid  with  small  A  the  direct  implementation 
was  even  faster.  If  “histogram”  binning  was  used,  how¬ 
ever,  we  observed  run-time  gains  of  a  factor  10  over  the 
direct  implementation. 

Thus  we  recommend  to  use  “histogram”  binning  and 
the  (adapted)  algorithms  of  Scott  (1993)  for  functional 
estimation  in  higher  dimensions. 
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Abstract 

A  set  of  Fortran  programs  has  been  developed  to 
obtain  (co)variance  estimates  for  multiple  trait  genetic 
analyses  with  different  models  for  each  trait  using  the  sparse 
matrix  package  SPARSPAK,  and  a  d^vative-free  algcxitbm 
to  obtain  REML  estimates  (MTDFREML).  A  typical 
analysis  would  include  birth  weight  of  all  animals,  weaning 
weight  and  yearling  weight  on  tiiose  surviving.  The  model 
would  include  direct  genetic  and  correlated  maternal  genetic 
effects  for  each  animal  and  uncorrelated  maternal 
environmental  effects  (a  total  of  33  (co)variance 
components)  as  well  as  other  fixed  or  random  effects 
associated  widi  the  traits.  The  simplex  algorithm  is  used  to 
search  for  components  to  minimize  -2  log  likelihood  ** 
FVALUE.  The  FVALUE  for  equaticms  of  order  60,000  or 
more  can  be  evaluated  on  personal  computers  for  each  of  the 
potentially  diousands  of  rounds  needed  to  obtain  REML 
estimates.  Efficiency  depends  on  density  of  the  mixed 
model  equations.  Nongenetic  models  are  usually  much  more 
sparse  than  genetic  models  that  incorporate  numerator 
relationships  among  the  animals.  Scaling  of  variables  is 
sometimes  a  problem  due  to  rounding  in  calculation  of 
FVALUE;  e.g.,  multiplying  categorical  variables  by  lOO  led 
to  successful  convergence.  The  search  algorithm  is  stopped 
when  variance  for  FVALUEs  in  tiie  Simplex  is  from  10'^  to 
10*^,  often  at  a  local  mimmiun.  With  multiple  trait  analyses, 
several  restarts  may  be  needed  to  find  the  global  maximum. 
An  evolving  strategy  is: 

1.  begin  with  only  variances  included  to  minimum  local 
convergence. 

2.  restart  with  covariances  included  to  minimum  local 
conv^gence  until  FVALUE  change  is  no  more  than 
a  unit. 

3.  restart  with  maximum  local  convergence  (10‘^  to 
10'^)  until  FVALUE  change  is  only  at  second  or 
tiiird  decunal  when  global  maximum  is  declared. 

Successful  analyses  with  MTDFREML  require  "art"  as  well 


Introduction 

Restricted  maximum  likelihood  (Patterson  and 
Thompson,  1972)  has  become  the  preferred  method  of 
animal  breeders  to  estimate  (co)variance  matrices  among  and 
within  traits  described  by  mixed  linear  models.  The 
traditional  algorithms  make  use  of  identities  based  on 
Henderson's  (e.g.,  1963,  1984)  mixed  model  equations  which 
have  computational  advantages  including  being  based  on  a 
simple  modification  of  least  squares  equaticms.  Algorithms 
based  on  derivatives  of  the  multivariate  normal  likelihood 
given  the  data  have  been  limited  in  scope  by  requiring 
inverse  elements  of  the  coefficient  matrix  of  the  mixed 
model  equaticms.  For  practical  purposes,  diat  has  meant 
mixed  model  equaticms  with  order  in  the  range  of  1000- 
5000. 

Derivative-free  algorithms  that  take  advantage  of  die 
sparsity  of  the  coefficient  matrix  have  greatly  eiqianded  die 
number  of  equations  that  can  be  managed  to  the  order  of 
50,000  to  150,000.  The  purpose  of  diis  note  is  to  outline 
briefly  the  science  of  DFREML  and  then  to  discuss  some 
aspects  of  the  "art"  of  DFREML  as  the  numerical  luoperties 
are  not  well  understood,  at  least  to  most  animal  breeders. 

The  Science  of  DFREML 

The  original  algorithm  for  DFREML  as  developed  in 
animal  breeding  traces  to  several  sources  including  die 
realization  that  Gaussian  elimination  of  augmented  least 
squares  (although  in  this  case,  mixed  model)  equations  can 
be  used  to  obtain  the  two  computing  intensive  parts  of  the 
log  likelihood  (Smith  and  Graser,  1986;  Graser,  Smith  and 
Tier,  1987)  as  the  keynote  speaker  for  this  conference 
described  (Stewart,  1994).  The  other  two  developments  were 
Hendersons'  mixed  model  equations  (e.g.,  1963)  and  the 
discovery  that  die  log  of  the  likelihood  can  be  written  in 
terms  of  four  components  of  the  mixed  model  equations 
(Harville,  1977;  Searle,  1979). 
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The  general  linear  model  in  typical  animal  breeding 
notation  is: 

y  "  XB  +  Zu  +  e 
E[y]  -  XB 


\ 

/ 

\ 

u 

G  0 

e 

0  R 

\  J 

V(y)  -  V  -  ZGZ  +  R 

where  y  is  die  vector  of  observations;  B  is  die  vector  of 
fixed  effects  widi  association  matrix,  X;  u  is  the  vector  of 
ranitnm  effects  with  association  matrix,  Z;  and  (co)variance 
matrix,  G;  and  e  is  the  vector  of  residuals  associated  with 
the  observations  with  (co)variance  matrix,  R. 

Henderson  (e.g.,  1984)  showed  that  solutions  to  mixed 
model  equations  provide  best  linear  unbiased  estimators  of 
estimable  functions  of  fixed  effects  and  best  linear  unbiased 
predictors  of  realized  values  of  random  effects. 

Henderson's  mixed  model  equations  04ME)  are: 
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\  ' 

In  simpler  notation:  C  s  =  r. 

Note  diat  except  for  the  usual  zero  covariance  between 
the  u  and  e  vectors,  die  mixed  model  equations  are 
completely  general  and  can  encompass  multiple  traits, 
migging  observations  on  some  traits  of  some  animals, 
different  models  for  different  traits  and,  for  animal  breeders, 
relationships  amcmg  animals  due  to  genes  in  common.  A, 
and  genetic  covariances  among  traits,  G^. 

Typical  random  factors  in  animal  breeding  models 
include  animal's  direct  genetic  value,  modier's  maternal 
genetic  value  (with  genetic  covariance  between  direct  and 
maternal  genetic  values),  animal  permanent  environmental 
effects  when  animals  have  repeated  records,  and  maternal 
permanent  environmental  effects  when  mothers  have  more 
than  one  progeny  with  records.  Other  genetic  models  used 
by  animal  breeders  may  include  instead  of  animal  effects, 
sire  transmitting  ability  (1/2  direct  genetic  value  of  sire), 
maternal  grandsire  effect  and  dam  permanent  environmental 
effect  Other  variations  are  also  used. 

The  large  number  of  variances  and  covariances  to 
estimate  from  multiple  trait  models  can  be  illustrated  for 
traits  with  a  direct  and  maternal  genetic  value  (with 


covariance)  and  two  other  random  factors  such  as  dam 
permanent  environmental  and  a  litter  effect  in  addition  to 
residual  effects.  A  single  trait  analysis  will  involve  five 
variances  and  one  covariance.  A  two-trait  analysis  will 
involve  diose  sk  elements  twice  plus  seven  other 
covariances.  A  three-trait  analyses  would  have  6  +  6  +  6  + 
7  +  7  +  7  -39  variance  and  covariance  components  to 
estimate. 

Harville  (1977)  and  Searle  (1979)  showed  tiiat  die 
multivariate  normal  likelihood  given  the  data  is: 

A  “  -.5[constant  +  log  |  R  |  +log  |  G  | 

+log  1  C  1  +  yTy]  where 

C  “  coefficient  matrix  for  MME  and 

P  -  v'^  -  v'^x(xV^x)'^xV^ 

Note  that  C  and  P  depend  on  R  and  G  as  well  as  on  X  and 
Z. 

Derivative-Free  Algorithms 

Derivative-free  algorithms  for  REML  are  based  on 
searching  for  the  combination  of  individual  variances  and 
covariances  associated  with  R  and  G  diat  will  maximize  A 
or,  more  usually,  will  minimize,  FVALUE  •  -2A.  The 
original  algorithm  of  Smidi  and  Graser  (1986)  and  that  used 
in  the  single  trait  program  of  Meyer  (1988)  which 
popularized  use  of  DFREML  was  based  on  sparse  matrix 
Gaussian  elimination  of  C  augmented  with  r  with  die  total 
glim  of  squares  in  die  corresponding  diagonal.  Gaussian 
elimination  automatically  produced  a  known  multiple  of  yTy 
and  log  I  C  I  ,  the  difficult-to-compute  terms  in  A.  The 
simplex  algorithm  (Nelder  and  Mead,  1965)  is  the  usual 
choice  to  search  for  (co)variances  to  minimize  -2A. 

Boldman  and  Van  Vleck  (1991)  used  subroutines  in 
SPARSPAK  (George,  et  al.,  1980;  Chu,  et  al.,  1984)  to 
decrease  the  time  to  calculate  -2A  by  factors  of  100  to  600 
from  die  times  required  by  the  original  algorithm  of  Meyer 
(1988).  SPARSPAK  is  based  on  Choleski  factorization  radier 
than  Gaussian  elimination  and  provides  a  more  general  form 
for  calculation  of  yPy  as  well  as  log  |  C  |  .  Both  the 
Gaussian  and  Choleski  based  algorithms  lead  to  general 
programs  which  are  not  model  dependent,  whereas 
derivative  based  algorithms  are  more  difficult  to  generalize 
because  of  the  requirements  to  calculate  a  quadratic  in  y  for 
each  (co)variance  component  and  to  calculate  the  eiqiectation 
of  the  quadratic  which  is  a  fimction  of  corresponding 
elements  of  the  inverse  of  the  coefficient  matrix. 
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Some  general  observations  are  that  derivative  based 
algorithms  are  slow  to  converge,  that  single  trait  DFREML 
converges  quickly  but  that  multi-trait  analyses  may  converge 
slowly  with  DFREML.  Many  restarts  may  be  needed  if 
covariances  are  estimated  (Press,  et  al.,  1989;  Groeneveld 
and  Kovak,  1990;  Boldman  and  Van  Vleck,  1990). 

The  Choleski  based  algorithm  used  for  the 
MTDFREML  package  (Boldman,  et  al.,  1993)  amsists  of 
two  basic  steps: 

1.  a  method  (the  simplex  algorithm)  to  search  for 
parameter  estimates  to  minimize  -2A  and 

2.  formation  and  solution  of  MME  for  parameter 
estimates  chosen  by  the  simplex  algorithm  by  use 
of  SPARSPAK  subroutines  to  take  advantage  of 
the  usual  sparsity  of  the  mixed  model  equations. 

Their  package  also  includes  a  program  to  calculate  the 
inverse  of  the  relationship  matrix  among  tiie  animals  to  be 
used  in  forming  the  mixed  model  equations  (Quaas,  1976) 
and  a  preparation  program  which  recodes  animal 
identiflcation  and  fixed  effect  levels  into  equation  numbers. 

Calculation  of  •2A 

If  yj  is  the  vector  of  observations  on  traits  measured 
on  animal  i,  then  the  residual  covariance  matrix  for  animal 
i  is  R*.  For  the  usual  assumption  that  residuals  from  one 
animal  to  another  are  uncorrelated,  then 
log  I  R  I  =  E  log  I  Rj  I  where  each  Rj  is  dependent  on 
the  number  of  traits  measured  on  animal  L  All  eigenvalues 
of  Rq,  the  maximum  order  of  any  R*,  must  be  positive. 
Thus,  (me  way  to  calculate  log  |  R  |  is  to  calculate  the  sum 
of  logarithms  of  eigenvalues  for  each  type  of  Rj  and 
multiply  by  the  number  of  each  type  of  Rj  and  then  sum 
over  all  types  of  R|.  The  log  |  G  |  can  be  calculated 
similarly  and  even  more  easily  (e.g.,  Meyer,  1989,  1991). 

For  example,  if: 


'A®G„ 

0 

0  ' 

G  = 

0 

0 

.  0 

0 

then 

log  1  G  1  =  1 

t  log  1  A  1 

+  qlog 

|Gol 

+  ] 

log  1  C 

11  1  + 

+  nLlog  1 

where  t  is  the  order  of  G^  which  is  the  genetic  covariance 


matrix  for  genetic  values  of  t  traits  of  an  animal;  q  is  the 
number  of  animals  in  A  which  is  tiie  numerator  relatmship 
matrix;  Cjj, ...,  Cjj^  are  die  covariance  matrices  for  the  L 
random  effects  that  are  (xxrelated  across  traits  but 
uncorrelated  across  animals  with  n^,  the  number  of  sets  of 
each  Cy. 

The  two  computing  intensive  terms  are  calculated  from 
die  Choleski  factorization  of  C  as  2  E  log  where  (jj 
is  the  diagonal  element  of  the  Choleski  factor.  The 
Choleski  factor  (xm  be  used  to  solve  fcff  s  so  that  yTty  is 
calculated  as  EyjRj'Vi-s'r  where  die  first  term  is 
(slculated  animal  by  animal. 

The  basic  steps  widi  sparse  matrix  techniques  are: 

1)  Symbolically  reorder  elements  of  C  (once) 

2)  Fbr  each  likelihood  calculation 

a)  update  6  and  R  via  simplex  and  calculate 
log  I  G  I  and  log  1  R  I  , 

b)  iqidate  C,  r,  and  E  yiRj’Vi  ^0“  updated  G, 
R,  and  original  y, 

c)  calculate  log  |  C  |  and  s'r  as  described  above, 

d)  check  for  convergence  (based  on  change  in 
-2A). 

Times  required  for  these  steps  were  98  sec  to  reorder, 
44.60  sec  to  factor,  and  1.32  sec  to  solve  (time  for  a 
likelihcxxl  calculation  -  44.60  +  1.32  ■  44.92  sec)  for  a 
single  trait  model  with  direct  and  maternal  genetic  effects 
and  maternal  permanent  effects  involving  3,111  animals  and 
7,303  equations.  A  traditional  derivative  method  would 
require  inversion  of  C  with  order  7,303  for  each  iteration. 

A  three-trait  example  (Lucia  Albuquerque,  personal 
communication,  1994)  intrciduces  problems  encountered  with 
multiple  trait  analyses.  The  records  were  milk,  fat  and 
protein  yields  for  New  York  Holsteins  with  measurements 
on  up  to  titree  lactations  per  cow.  The  model  included 
animal  genetic  (9,722)  and  animal  permanent  envircmmental 
effects  (animals  with  records  ”  5,706)  and  management 
levels  (1,509)  associated  with  herd-year-season  at  initiation 
of  each  lactation.  The  table  gives  number  of  equations  and 
computing  times  for  one,  two,  and  three  trait  analyses: 


Milk 

M,F 

M,F,P 

Equations(no.) 

16,937 

33,874 

50,811 

Re-order(sec) 

18 

61 

129 

Likelihood(sec) 

26 

179 

594 
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The  advantage  of  sparsity  is  illustrated  by  a  similar  sample 
from  California  including  10,438  animals,  5,877  cows  widi 
records  and  only  225  H-Y-S  of  freshening.  The  smaller 
number  of  H-Y-S  levels  resulted  in  less  reorder  time  and 
especially  less  time  to  calculate  -2A;  14  sec  to  reorder  for 
one  trait;  103  sec  to  reorder  for  three  traits  and  11  and  143 
sec  for  each  likelihood  compared  to  times  of  26  and  594  sec 
for  die  New  York  data.  The  increased  time  fca-  each 
calndadon  of  -2A  combined  with  many  restarts  shows  that 
convergence  takes  a  long  time  with  even  three  traits. 


California 

New  York 

Number 

Milk 

M,FJ» 

Milk 

M,F,P 

Restarts 

1 

10 

1 

10 

A/Restart 

88 

400 

95 

410 

Total  A 

88 

4000 

95 

4100 

Total  time 

16.4m 

6.6d 

41.5m 

28.2d 

reach  global  convergence  but  the  three-trait  analyses  took 
about  a  week  for  the  California  sample  and  about  a  month 
for  the  New  York  sample  due  to  the  time  per  likelihood 


calculation  and  die  number  of  restarts  that  was  needed.  The 
increase  in  time  for  calculation  of  A  for  the  New  York 
sample  is  due  to  die  increase  in  number  of  levels  of  H-Y-S. 

Starting  values  for  multiple  trait  analyses  are  important 
with  DFREML  as  illustrated  by  two  analyses  with  different 
pairs  of  traits  for  the  same  animals.  The  first  two  traits  were 
animal  birth  weight  when  bom  1)  to  a  young  mother  and  2) 
to  an  older  mother.  The  model  included  direct  and  maternal 
genetic  values  (with  covariance)  and  maternal  permanent 
environmental  effects  as  some  older  mothers  had  more  than 
one  calf.  Thus,  the  total  number  of  (co)variances  was  15; 
3212  animals  contributed  to  relationships,  765  and  1306 
calves  were  bom  to  young  and  older  mothers  resulting  in 
14,676  mixed  model  equations.  Starting  values  for  variances 
were  based  on  single  trait  analyses  except  that  a  major  input 


error  went  unnoticed  for  one  maternal  genetic  variance.  A 
total  of  22  restarts  (restarts  were  after  150  simplex  rounds  or 
variance  of  the  simplex  less  than  l.E-6*)  was  needed  before 
-2A  changed  less  than  .01  from  restart  to  restart  The  pattern 
of  -2A  after  each  restart  was  11500  plus  in  turn;  34.23, 
29.79,  29.67,  29.28*,  26.82,  25.03,  24.55,  24.43*,  24.15*, 
23.87*,  23.56*,  20.99,  18.54,  18.07,  17.97,  17.91,  17.85, 
17.72*,  17.42,  17.06,  17.02  and  17.01*  when  global 
convergence  was  assumed.  Several  times  the  system  seemed 
on  die  verge  of  convergence  but  would  then  continue  to  a 


better  set  of  estimates. 

The  similar  analyses  were  with  calving  ease  substituted 
fix  birdi  wei^t  Calving  ease  is  a  trait  diat  is  categorically 
measured  which  often  results  in  slow  convergence.  This 
timp.  die  starting  values  were  correctly  inputed  and  only  6 
restarts  resulted  in  convergence  with  consecutive  -2A  of 
1007.12, 994.99, 992.04, 987.62, 986.35  and  986.35*.  These 
analyses  illustrate  some  of  the  frustraticnis  with  DFREML 
for  multiple  trait  analyses  and  serve  to  introduce  the  "art"  of 
DFREML. 


The  'ART"  of  DFREML 
Convergence 

The  question  of  how  to  proceed  most  efficiendy  to 
find  solutions  that  are  globally  maximum  causes  many 
headaches,  results  in  some  degree  of  doubt  about  the 
reliability  of  DFREML,  and  is  still  basically  an  art  form  with 
few  established  rules.  The  simplex  algorithm  is  not 
guaranteed  to  reach  a  global  minimum  (in  diis  case  for  -2A). 
It  may  lead  to  a  local  minimum.  Usually  die  stopping  point 
after  a  start  is  based  on  the  variance  of  the  n  +  1  log 
likelihood  values  retained  in  the  simplex  where  n  is  the 
number  of  parameters.  Common  stopping  points  are  when 
V(-2A)  is  less  than  a  predetermined  value  such  as  l.E-4, 
l.E-6,  or  l.E-8.  An  alternative,  based  on  experience,  is  to 
restart  after  a  certain  number  of  simplex  roimds  or  when 
V(-2A)  is  less  dian  the  predetermined  constant.  (Each 
simplex  round  requires  on  average  about  two  likelihood 
evaluations.)  Then  -2A  is  examined  for  improvement  from 
die  previous  start.  If  the  improvement  in  -2A  is  less  dian  .01 
to  .05,  then  another  restart  usually  results  in  litde  additional 
improvement  Another  alternative  is  based  on  the  previous 
one  but  includes  an  examination  of  variances  as  fractions  of 
total  variance  as  well  as  of  correlations.  If  such  proportions 
do  not  change  in  the  second  decimal,  global  convergence  is 
likely.  Nevertheless,  e;q>erience  as  well  as  such  ad  hoc 
guidelines  are  needed  until  precise  rules  are  developed.  For 
example,  shovJd  restarts  be  limited  to  a  specific  number  of 
simplex  updates,  should  restarts  be  terminated  after  the 
variance  of  simplex  has  fallen  below  a  pre-determined  value, 
or  should  some  combination  be  used?  What  would  be  the 
best  choices  for  number  of  simplex  rounds  and  variances? 

The  following  table  shows  -2A  at  three  convergence 
levels  for  10  samples  of  milk  records  with  first,  second,  and 
third  lactations  being  considered  separate  traits  (Lucia 
Albuquerque,  personal  communication,  1994). 
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.2A  FOR  THREE  CONVERGENCE 
CRITERIA  (10  samples) 


Convergence  Criterion 

Sample 

l.E-4 

l.E-6 

l.E-9 

1 

58601.32 

58601.20 

58601.07 

2 

55087.71 

55087.71 

55087.69 

3 

57122.57 

57122.49 

57122.49 

4 

53185.73 

53185.71 

53185.65 

5 

52942.48 

52942.43 

52942.14 

6 

51778.50 

51778.47 

51778.46 

7 

53446.38 

53443.84 

53443.84 

8 

50851.60 

50851.52 

50851.43 

9 

53778.04 

53778.00 

53777.97 

10 

55685.27 

55685.24 

55685.06 

The  table  illustrates  the  art  of  deciding  whether  global 
convergence  has  been  reached.  For  scHne  samples,  l.E-4  and 
l.E-6  led  to  similar  -2A  with  l.E-6  always  reaching  a 
smaller  (better)  value.  In  other  cases,  continuing  to  l.E-9 
resulted  in  improvement.  The  importance  of  differences  in 
-2A  at  the  second  decimal  is  difficult  to  (juantify. 

Proportional  estimates  of  the  variances  and  correlations 
for  the  averages  of  the  same  10  samples  at  convergence  of 
l.E-6  and  l.E-9  after  many  restarts  are  shown  below. 

AVERAGES  FOR  TWO  CONVERGENCE  CRITERIA 


Convergence  Criterion 


Lactations 

l.E-6 

l.E-9 

HERITABILITIES 

01 

.35 

.35 

02 

.34 

.34 

03 

.33 

.32 

GENETIC  CORRELATIONS 

(01x02) 

.87 

.87 

(01x03) 

.81 

.81 

(02x03) 

.97 

.97 

ENVIRONMENTAL  CORRELATIONS 

(01x02) 

.43 

.43 

(01x03) 

.38 

.38 

(01x03) 

.44(.444) 

.45(.445) 

PHENOTYPIC  CORRELATIONS 

(01x02) 

.58 

.58 

(01x03) 

.53 

.53 

(01x03) 

.62 

.62 

To  two  decimals  die  averages  of  proportions  were 
essentially  the  same.  For  animal  breeding  applications  even 
changes  in  fractional  variances  from,  fca:  example,  .30  to  .35 
are  not  often  important. 

Experience  has  been  that  1)  for  single  trait  analyses 
widi  no  imbedded  covariances  such  as  the  direct-maternal 
genetic  covariance  global  convergence  is  usually  readied 
when  V(-2A)  is  less  than  l.E-6,  although  one  restart  is  a 
safety  measure,  2)  for  a  single  trait  analysis  with  a  direct- 
maternal  covariance  at  least  one  restart  is  needed  and  3)  for 
multiple  trait  analyses  many  restarts  will  be  needed  with  die 
number  dependent  on  starting  values,  die  complexity  of  die 
model,  and  even  die  scale  of  measurments.  Hie  multiple 
trait  "rule"  is  restart,  restart,  ...,  until  -2A  does  not  change 
more  than  about  .01. 

Boundary  Conditions 

As  with  any  REML  algcmthm,  solutions  outside  die 
parameter  space  are  not  estimates.  For  example,  variances 
must  be  greater  than  zero  and  absolute  values  of  genetic  and 
other  correlations  must  not  &cceed  unity.  In  addition, 
eigenvalues  of  matrices  such  as  R^  and  which  represent 
environmental  and  genetic  covariance  matrices  for  traits 
measured  on  an  animal  must  be  positive.  As  part  of  the 
simplex  algorithm  whenever  an  update  of  a  solution  is  not 
allowed,  a  large  value  is  assigned  to  -2A  which  forces  a 
contracticHi  of  the  simplex  update.  If  necessary,  other 
contracticHis  are  forced  until  the  update  is  allowed.  Such 
contracti(»is  are  done  before  the  expensive  calculation  of 
log  I  C  I  and  yPy  so  that  little  time  is  wasted.  Solutions 
near  boundaries,  however,  often  indicate  many  rounds  will 
be  needed  as  solutions  may  creep  to  the  boundary  of  allowed 
estimates. 

Sign  of  Correlations 

Tbe  simplex  operates  by  updating  current  solutions  by 
increasingly  smaller  fractions  of  the  current  solutions.  If  a 
starting  correlation  (covariance)  is  positive  and  the  optimum 
solution  is  negative,  the  search  must  pass  from  positive  to 
negative  values  of  the  covariance.  Experience  indicates  diat 
the  cross-over  requires  many  rounds  of  likelihood 
evaluations. 
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Rounding  in  Calculation  of  -2A 

A  problem  that  occasioiudly  occurs  is  that 
convergence,  Le.,  variance  of  -2A  in  the  simplex  will  never 
be  less  than  l.E-6.  In  such  cases,  that  variance  typically 
bounces  around  at  values  larger  than  l.E-6.  Rounding  error 
in  calculation  of  -2A  from  the  four  components  of  A  is 
likely  the  reason.  The  amount  of  rounding  error  will  be 
computer  and  possibly  compiler  dependent.  The  potential  for 
roimding  error  is  illusttated  by  -2A  values  in  the  range  of 
190,000  for  which  V(-2A)  is  to  be  less  than  l.E-6  or  l.E-8 
at  convergence. 

Another  experience  also  may  be  due  to  rounding  error. 
At  least  one  analysis  has  shown  a  cyclic  fluctuation  in  -2A 
such  as  24470,  24490, 24470,  ...  which  insures  that  V(-2A) 
is  large  and  convergence  based  on  V(-2A)  will  not  be 
attained.  Examination  of  die  solutions  showed  only  slight 
differences  even  though  the  -2A  for  each  set  of  solutions 
were  quite  different.  A  possible  explanation  is  that  parts  of 
the  likelihood  involve  logs  of  small  eigenvalues  which  are 
then  multiplied  by  a  number  such  as  the  number  of  animals. 

Binomial  data  with  values  of  0  and  1  or  1  and  2  have 
led  to  the  jffoblem  described  in  the  previous  paragraph. 
"When  convergence  has  not  been  attained,  two  approaches 
have  been  followed.  Multiplying  the  binomial  values  by  100 
sometimes  seems  to  lead  to  better  numerical  properties. 
Another  alternative  has  been  to  change  to  a  sire  model  rather 
than  to  continue  with  an  animal  model. 

Starting  Values 

The  importance  of  starting  values  depends  primarily  on 
whether  the  analysis  contains  covariance  terms.  For  a  single 
trait  analysis,  the  sum  of  components  at  the  start  should  be 
reasonable,  i.e.,  less  dian  the  taw  variance.  At  least  one 
analysis  failed  to  reach  convergence  when,  by  oversight,  the 
starting  variances  were  all  several  times  the  true  variances. 
With  direct-maternal  genetic  covariance  included  for  a  single 
trait,  choice  of  conect  sign  of  the  covariance  is  important  as 
discussed  earlier.  The  covariance  should  not  be  started  as 
zero  because  the  steps  of  the  simplex  algorithm  ate 
proportions  of  the  previous  solutions.  A  starting  zero  will 
remain  zero. 

Multiple  trait  analyses  take  more  time  per  round,  more 
rounds  to  simplex  convergence  and,  usually,  many  restarts 
to  attain  global  convergence;  thus  good  starting  values  are 


important  One  suggestion  is:  1)  do  single  trait  analyses  to 
determine  variances  and  within-trait  direct  maternal 
covariances,  2)  start  with  aaoss-trait  covariances 
corresponding  to  moderate  correlations  and  die  better  guess 
of  positive  OT  negative  sign  while  holding  variances  from  1) 
constant  (an  opticm  in  the  MTDFREML  program);  and  then 
3)  let  all  (co)variance  elements  vary  in  the  simplex  with  die 
prospect  of  several  restarts. 

Conclusions 

Derivative-free  REML  with  sparse  matrix  methods 
based  on  Henderson's  mixed  model  equations  has  expanded 
the  magnitude  of  single  and  multiple  trait  analyses  to  obtain 
REML  estimates  of  variances  and  covariances.  Single  trait 
analyses  converge  quickly.  The  "art"  of  DFREML  mainly 
involves  rules  for  reducbg  time  to  global  convergence  for 
multiple  trait  analyses.  Optimum  starting  values  and  restart 
strategies  have  not  been  determined,  although  obvious  ad 
hoc  rules  have  been  evolving.  Restarts  to  insure  convergence 
to  a  global  mflyimiim  for  A  (or  minimum  for  -2A)  are 
mandatory  for  multiple  trait  analyses.  Help  is  needed  1)  to 
develop  an  improved  updating  algorithm,  2)  to  determine 
starting  strategies  for  multiple  ttait  analyses,  and  3)  to  design 
a  general  method  for  restarting  to  obtain  most  efficiently 
solutions  that  have  converged  to  the  global  maximum  of  the 
likelihood  given  the  data. 
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Abstract:  For  generating  response  surface 

designs,  most  general  purpose  (‘‘D-optimaP) 
algorithms  work  point  by  point  in  the  design 
domain.  We  introduce  a  class  of  algorithms 
operating  in  the  dual,  factor/column  space.  Their 
basic  operations  exchange,  randomly  and  sys¬ 
tematically,  the  rows  of  certain  columns  (factors) 
with  respect  to  the  rows  of  other  columns. 

This  dual  space  approach  is  especially  suitable 
for  designing  computer  experiments  of  Latin  hyper- 
cube  type.  The  experimenter  can  embed  two-  and 
three-level  response  surface  designs,  both  to  match 
a  calibration  subset  and  to  achieve  high  efficiency. 
More  centrally,  the  experimenter  explicitly  chooses 
the  number  of  factor  levels  and  their  frequencies, 
ideal  both  for  considering  model-free  goodness-of-fit 
and  for  establishing  interpolation  grids. 

i.  Computer  experiments 

Complicated  physical  phenomena  are 
increasingly  well  modeled  by  computer  simulators. 
The  underlying  physical  theory  usually  involves 
two-to-four  dimensional  differential  equations  with 
boundary  conditions,  key  complications  consist  of 
multiple  materials,  their  interfaces  and  geometrical 
structures.  Important  methods  encompass  auto¬ 
matic  grid  generators,  parallel  computing  algo¬ 
rithms,  finite  element  analysis,  and  empirical 
metrology  and  calibration  procedures.  Typical 
simulator  applications  include  verification  and  opti¬ 
mization  of  product  designs  and  policies,  diagnosis 
of  problems  and  opportunities,  evaluation  of 
difficult-to-measure  constructs,  development  of 
predictive  models,  and  estimates  of  distributions. 

Experiments  using  computer  simulators,  so- 
called  computer  experiments,  are  a  focus  of 
statistical  methods  research.  Among  their  special 
considerations  are  their  ability  to  repeat  perfectly, 
the  increased  feasibility  to  run  larger  experiments, 
and  the  opportunity  to  fit  richer,  more 
nonparametric  models.  Modeling  approaches  in¬ 
clude  kriging  (Matheron  [1971],  Sacks  et  al  [1989]), 
nonparametric  regression  (Friedman  [1991]),  and 
neural  networks  (Cheng  and  Titterington  [1994]). 
This  work  is  shaped  by  target  applications:  Sacks 


et  al  (1989)  predict,  then  optimize  analog 
integrated  circuit  performance.  Friedman  (1991) 
emphasizes  graphical  visualization  and  decom¬ 
position.  Several  authors  perturb  simulator  inputs 
to  project  output  distributions.  Their  methods 
range  from  Monte  Carlo  sampling  (Kibarian  and 
Strojwas  [1991]),  low-order  moment  estimation 
(Zaino  and  D’Errico  [1988]),  and  Latin  hypercube 
sampling  (McKay  et  al.  [1979]). 

This  paper  is  on  designing  computer 
experiments,  in  particular  using  Latin  hypercubes 
(LHCs),  When  introduced  by  McKay  et  al  (1979), 
LHCs  were  constructed  by  random  mechanisms, 
and  have  since  been  shown  to  be  more  efficient  for 
distribution  estimation  than  Monte  Carlo  sampling 
(Stein  [1987],  Owen  [1992a]).  LHCs’  advantage  is 
further  increased  by  constructing  them  using 
orthogonal  arrays  (Owen  [1992b],  Tang  [1993]). 

An  alternative  computer  experiment  design 
approach  is  that  of  Sacks  et  al  (1989),  who 
introduce  a  class  of  optimal  experiments  based  on  a 
kriging  model,  in  the  sense  of  minimizing 
integrated  mean  square  error.  Figure  1  shows  the 
scatterplot  matrix  of  their  32-run  6-factor  design. 
Note  that  each  projection  into  two  dimensions 
shows  a  characteristic  five-spot  X-pattern.  Many 
researchers  have  found  this  pattern  objectionable, 
preferring  the  symmetry  of  Latin  hypercubes. 

S.  Problem  Statement 

Like  many  others,  our  ultimate  application  is 
distribution  estimation.  The  economics  of 
simulation  motivate  our  approach.  Simulations  are 
relatively  slow,  on  the  order  of  12-24  hours  each. 
This  encourages  us  to  build  an  intermediate  model, 
one  from  which  we  can  interpolate  other  values. 
Also,  at  certain  points  in  the  design  domain  we 
have  empirical  measurements,  whose  configuration 
forms  a  conventional  2^'^3^  response  surface 
design.  To  these  we  need  to  match  their 
corresponding  simulations,  in  order  to  calibrate  the 
model  correctly.  The  empirical  measurements  are 
also  precious,  and  the  time  it  takes  to  develop  them 
ultimately  bounds  the  number  of  simulations  we 
can  perform. 
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To  summarize,  the  particular  characteristics  of 
our  computer  experiment  are  the  following:  (1) 
Computer  simulations  are  time-consuming,  hence 
precious,  and  efficient  designs  are  therefore 
desirable.  (2)  The  computer  experiment  is  used  to 
establish  an  interpolation  grid,  from  which  an  easy- 
to-evaluate  model  can  be  developed.  (Observe  that 
the  application  of  Latin  hypercube  sampling,  by 
which  a  simulator  is  evaluated  in  order  to  estimate 
the  distribution  of  an  output  parameter,  is  moved 
outside  our  scope.  We  can,  of  course,  apply  Latin 
hypercube  sampling  to  our  interpolated  model.) 
(3)  Some  of  the  computer  simulations  are  fixed  a 
priori  (to  match  empirical  measurements  for 
calibration,  perhaps  to  improve  the  interpolation 
grid),  and  we  would  like  take  advantage  of  these 
runs.  (4)  The  size  of  the  experiment  is  small  to 
moderate  —  for  definiteness,  say  50-100  runs  of 
about  8-10  factors.  (5)  Beyond  design  optimality, 
we  would  like  to  preserve  some  sensitivity  to 
detecting  model  lack  of  fit. 

5.  Optimal  Experiments  and  Design  Repair 

Much  of  the  theory  of  optimal  designs  is  based 
on  conventional  linear  models,  with  homogeneous, 
independent  errors.  This  literature  has  two  themes. 
By  one  theme,  with  respect  to  a  particular  model, 
one  defines  criteria  by  which  one  can  compare 
designs.  These  are  usually  functions  of  the 
coefficients’  variance-covariance  matrix;  the  most 
common  is  the  determinant,  the  so-called  D- 
optimality  criterion.  By  theme  two,  the  design 
domain,  in  principle  continuous,  is  reduced  to  a 
finite  set.  For  example,  with  one  factor,  the 
optimal  design  of  a  linear  model  is  well  known  to 
concentrate  all  points  at  the  extremes  of  the 
feasible  range.  Similarly,  optimal  designs  for 
quadratic  models  concentrate  all  design  points  at 
three  levels.  Atkinson  and  Donev  (1993)  give  a 
contemporary  account  of  optimal  design  literature. 
In  practice,  for  computer  experiments,  the  limited 
variety  of  points  in  the  design  domain  has  made 
conventional  optimal  designs  unattractive. 

Our  basic  approach  adapts  the  columnwise  D- 
optimal  algorithm  of  Heavlin  and  Finnegan  (1993). 
The  “design  repair”  algorithm  presented  therein 
uses  the  D-optimal  criterion  (theme  one,  above), 
but  not  the  restricted  design  domain  (theme  two). 
Instead,  the  experimenter  chooses  each  factor’s 
levels,  and  the  frequency  with  which  they  are  used. 
For  computer  experiments  of  the  Latin  hypercube 
type,  with  n  runs,  this  means  the  levels  of  each 
factor  are  the  values  equal  spacing  is  used 


to  improve  the  interpolation  grid. 

The  design  repair  approach  also  uses 
conventional  linear  models,  and  homogeneous, 
independent  errors.  Its  natural  domain  of 
applicability  is  sequential  batch  processes,  e.g. 
semiconductor  manufacturing.  Applications  include 
assigning  interacting  covariates,  adapting  experi¬ 
ments  to  lost  experimental  units  (e.g.  broken 
silicon  wafers),  designing  responses  with  partially 
overlapping  factor  sets,  and  allocating  noise-factor 
batch  positions.  For  conventional  response  surface 
designs,  design  repair  has  proven  useful  for  finding 
partially  balanced  incomplete  block  designs, 
combining  mixture  and  nonmixture  factors,  cre¬ 
ating  level-balanced  response  surface  designs,  and 
constructing  loss  resistant  experimental  designs. 

Design  repair’s  primary  data  structures  are  two 
partial  design  matrices,  W  and  A,  and  one  model. 
Both  W  and  X  have  n  rows,  and  Fj^  and 
columns  respectively.  In  addition,  for  certain 
problems  we  wish  to  include  certain  experiments, 
certain  complete  rows.  We  denote  these  rows  by 
WXfj.  Let  Wi  (Xi)  denote  the  ith  row  of  IV  (X), 
Let  TT  denote  a  permutation  of  the  row  indices 


We  would  like  to  form  the  design  matrix 
whose  2th  row  is  Wi  and  and  which 

includes  the  a  priori  rows  WXq^  that  is, 

r  ^ 

= - , 


wx„ 


where  A^  denotes  the  n  x  Px  matrix  whose  ith  row 
is  X^^^y  From  WA^  we  can  develop  a  model 
matrix  whose  iih  row  is  Vff  =  m(WX^).  For 
example,  were  W  and  A  both  one  column  matrices, 
and  our  desired  model  a  full  quadratic,  the 
m(u4jU^  returns  the  row  vector  (1^ 

Upup,  corresponding  to  the  constant,  two  linear, 
one  interaction,  and  two  quadratic  terms.  Hence, 
has  the  form 

W  X^  I  higher 
=  - - - I  order  . 

WXq  I  terms 


The  design  repair  algorithm  works  to  find  the 
best  TT,  or  at  least  a  good  one,  so  that  we  can 
estimate  by  least  squares  the  linear  coefficients  of 
l/F .  We  choose  the  D-optimality  criterion,  for 
which  larger  values  are  better: 


D{M)^  ln(det(VF M))^  for  Af  non-singular, 
=  —  oo,  otherwise. 


With  a  quantitative  measure  (D)  of  a  good 
design  specified,  the  design  repair  algorithm  is 
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easily  imagined.  It  comes  in  two  parts,  a  random 
starting  point,  called  R-step,  and  a  deterministic 
search  over  exchanges  or  transpositions  of  pairs  of 
elements  in  tt,  called  E-step. 

R-step;  7r  is  selected  at  random  from  all  possible 
permutations,  W](F  and  constructed,  and 

D(M^)  evaluated.  This  is  repeated  for  iter¬ 
ations,  keeping  track  of  the  best  tt.  As  the  number 
of  iterations  increases,  discovery  of  better  permu¬ 
tations  becomes  less  likely  and  R-step  becomes 
inefficient.  This  motivates  E-step. 

E-step:  As  a  starting  point,  El-step  uses  the  best 
permutation  found  from  R-step,  say  All  (  ”  ) 
combinations  formed  by  exchanging  a  pair  of 
indicies  are  then  considered,  that  with  the  largest 
£>-value  selected.  In  this  way,  E-step  is  repeated 
until  no  further  improvement  from  pairwise 
exchanges  of  indices  (rows  of  X^)  is  found. 

Specification  of  For  the  optimum  number  of 
R-steps,  n^,  Heavlin  and  Finnegan  (1993)  develop 
an  heuristic  and  approximate  relationship: 
log^o(^W  ~  -0.71  +  2.12  log^Q(n).  In  one  region 
of  interest,  n  about  50,  this  implies  =  800. 

4^  Computer  Experiment  Test  Case 

To  use  the  design  algorithm,  one  must  specify 
the  matrices  W  and  A,  and  an  appropriate  model. 
For  computer  experiments,  one  usually  needs  to 
apply  the  design  repair  algorithm  several  times  in 
series,  building  up  the  columns  of  the  design  in 
stages.  Let  X,  and  m7  denote  these  objects  for 
the  ^h  application  of  design  repair.  Denote  by 
DR(W^fjdf^)  the  solution  from  the  ;th  step. 
Three  issues  need  addressing: 

L  Path:  =  (1,2, seems  natupl, 

as  does  =  DR(W^f  X,  m7).  How  should  JP  be 

selected  for  j>  2?  The  fastest  route  is  A^  = 
which  we  call  “doubling.”  This  allows  us  to  start 
initially  with  a  column  matrix,  then  obtain  a  two- 
column  design,  then  four,  then  8,  and  so  on.  The 
alternative  is  to  choose  A^  =  A^  for  all  j.  This 
builds  up  the  design  slowly,  one  column  at  a  time. 
This  path  we  call  “add  one.” 

2.  Bases;  Should  the  model  be  described  as  a 
polynomial,  or  are  there  useful  alternatives,  such  as 
using  terms  of  a  Fourier  series? 

Models:  What  model,  in  particular  which 
interactions,  should  be  specified?  At  one  extreme, 
one  can  specify  a  purely  additive  model,  with  no 
interactions  among  the  factors;  at  the  other 
extreme,  one  might  pose  as  large  a  set  of 
interactions  as  feasible. 

As  a  test  case,  we  develop  an  8-factor,  51-run 


Latin  hypercube.  There  are  several  reasons  for  this 
choice.  The  size  of  this  experiment  is  large  enough 
to  be  practical,  yet  small  enough  for  design  repair 
to  handle  reasonably.  51  runs  allow  us  to  specify 
=  (-1,- 0.96,- 0.92,..., 0,  0.04,  + 

Finally,  Tang  (1993)  has  published  scatterplot 
matrices  for  a  49-run  8-factor  LHC  constructed 
using  orthogonal  arrays,  giving  us  a  good  standard 
for  comparison.  To  facilitate  comparisons,  we  use 
no  WXq  matrix. 

For  this  exercise,  we  follow  both  the  doubling 
path,  and  the  add-one  path.  For  bases,  we  use  a  7- 
degree  (orthogonalized)  polynomial,  whose  terms 
before  orthogonalization  correspond  to  w^, 

and  these  seven  columns  comprise 
WK  As  an  alternative  basis,  we  also  consider  seven 
terms  of  a  Fourier  series,  corresponding  to  w, 
sin(27rw)f  cos(2Trw)f  sin(4T^n)),  cos(4Tw)jsin(8jrw)f 
cos(8rrw)f  also  orthogonalized.  To  enhance 
comparability,  for  these  four  designs,  we  choose 
additive  models,  with  no  interactions;  is  the 
same  in  all  constructions,  the  attractive  result  of  a 
design  repair  construction  using  the  seven-term 
Fourier  basis  and  a  high-order  interaction  model. 

Judging  from  scatterplot  matrices,  the  most 
satisfying  design  is  that  using  the  polynomial  basis 
and  add-one  path  (figure  2),  comparable  to  Tang’s 
figure  3  of  an  orthogonal  array-based  LHC.  Space 
limitations  prohibit  showing  scatterplot  matrices  of 
the  other  bases  and  paths,  but  the  scatterplot 
matrices  of  both  add-one  constructions  are  more 
satisfying,  with  points  well  spread  out  and  no  large 
area  unoccupied,  than  those  from  the  doubling 
path.  (This  agrees  with  the  authors’  experiences  in 
other  computer  experiment  applications.)  In  both 
cases,  the  polynomial  constructions  are  somewhat 
more  pleasing  than  those  using  the  Fourier  series. 

Figure  3  is  the  scatterplot  matrix  of  the  51-run 
7-factor  design  repair  construction.  Like  that  in 
figure  2,  it  is  the  result  of  the  add-one  path  and  the 
7-term  polynomial  basis.  Unlike  figure  2,  it  uses  a 
series  of  rich  models:  a  full  six-order  model 

(81  terms);  a  full  fifth-order  model  (124 

terms);  a  {\i\l  third-order  model,  plus  all 

fourth-order  terms  involving  the  fifth  factor,  plus 
pure  quartic  terms  for  all  five  factors  (127  terms); 

a  full  cubic  model,  plus  all  pure  fourth-order 
terms,  plus  all  mixed  interactions  involving  the 
sixth  factor  (99  terms);  and  a  full  cubic  model, 
plus  pure  quartic  terms  (77  terms).  These  models 
have  more  terms  than  runs;  the  D-criterion  is 
modified  to  ln(det(AFM-i-XI)),  with  A  =  0.1. 

Under  the  constraints  of  the  LHC  margins, 
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figure  3  shows  an  X-pattern  similar  to  that  of 
Sacks  et  al  (1989).  One  might  speculate  that  the 
kriging  model  is  related  to  interaction-rich  models. 
An  alternate  interpretation  is  that  the  design  repair 
approach  to  LHC  construction  should  be  applied 
only  for  additive  models. 

5.  Conclusions 

The  design  repair  algorithm  can  construct  Latin 
hypercube  designs  successfully.  Conditions  where 
this  is  appropriate  are  listed  in  section  2;  the  key 
ingredients  are  a  design  of  moderate  scope  with 
some  particular  requirements.  Based  on  reviewing 
scatterplot  matrices  of  the  resulting  designs, 
polynomial  models  work  at  least  as  well  as  the 
alternatives.  The  add-one  path  allows  the  models 
to  be  specialized  to  each  step  of  construction;  for 
this  reason,  it  is  not  unexpected  that  add-one 
designs  have  better  esthetic  properties  than  designs 
based  on  the  doubling  path. 

A  well  recognized  analogy  is  on  one  hand  with 
two-level  factors,  linear  models,  and  resolution  III 
projection  properties  and,  on  the  other  hand,  with 
multilevel  factors,  additive  models,  and  two- 
dimensional  projections  (called  strength  2).  For 
this  reason,  one  might  anticipate  that  additive 
models  would  give  appealing  scatterplots  matrices, 
which  are  merely  graphical  strength  2  assessments. 
The  similarity  of  X-patterns  both  in  Sacks  et  al 
(1989)  and  figure  3’s  interaction-rich  Latin 
hypercube  construction  is  more  tantalizing,  perhaps 
pointing  to  some  connection  between  the  two 
approaches  for  high-dimensional  designs. 
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Figure  J.  Scatterplot  matrix  of  the  6-factor 
computer  experiment  of  Sacks  et  al  (1989),  The 
optimedity  criterion  is  minimum  integrated  mean 
square  error;  the  model  a  kriging  one. 


Figure  S.  ScaUerplot  matrix  of  a  51-run  S-facior  Latin  kgpercube  using  the  design  repair  algorithm. 
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figure  3  shows  an  X-pattern  similar  to  that  of 
Sacks  et  al  (1989).  One  might  speculate  that  the 
kriging  model  is  related  to  interaction-rich  models. 
An  alternate  interpretation  is  that  the  design  repair 
approach  to  LHC  construction  should  be  applied 
only  for  additive  models. 

5.  Conclusions 

The  design  repair  algorithm  can  construct  Latin 
hypercube  designs  successfully.  Conditions  where 
this  is  appropriate  are  listed  in  section  2;  the  key 
ingredients  are  a  design  of  moderate  scope  with 
some  particular  requirements.  Based  on  reviewing 
scatterplot  matrices  of  the  resulting  designs, 
polynomial  models  work  at  least  as  well  as  the 
alternatives.  The  add-one  path  allows  the  models 
to  be  specialized  to  each  step  of  construction;  for 
this  reason,  it  is  not  unexpected  that  add-one 
designs  have  better  esthetic  properties  than  designs 
based  on  the  doubling  path. 

A  well  recognized  analogy  is  on  one  hand  with 
two-level  factors,  linear  models,  and  resolution  III 
projection  properties  and,  on  the  other  hand,  with 
multilevel  factors,  additive  models,  and  two- 
dimensional  projections  (called  strength  2).  For 
this  reason,  one  might  anticipate  that  additive 
models  would  give  appealing  scatterplots  matrices, 
which  are  merely  graphical  strength  2  assessments. 
The  similarity  of  X-patterns  both  in  Sacks  et  al 
(1989)  and  figure  3’s  interaction-rich  Latin 
hypercube  construction  is  more  tantalizing,  perhaps 
pointing  to  some  connection  between  the  two 
approaches  for  high-dimensional  designs. 
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Abstract 

In  recent  years,  modeling  spatial  processes  on  the  sphere 
{e.g.f  mining,  oil  exploration,  forestry,  pollution,  ozone 
levels,  etc.)  has  become  more  abundant.  But  through  it 
all,  there  has  been  no  generally  accepted  global  sam¬ 
pling  plan  and  none  for  which  a  central  limit  theo¬ 
rem  (CLT)  nor  resampling  algorithm  has  been  formu¬ 
lated.  Some  of  the  global  sampling  plans  that  have  been 
used  are  either  derived  from  experimental  design  meth¬ 
ods  or  geographical  methods.  In  this  paper,  we  out¬ 
line  each  of  the  above  types  of  sampling  plans,  describ¬ 
ing  their  strengths  and  weaknesses,  and  then  describe 
a  global  sampling  plan  called  a  stratified  spherical  sam¬ 
pling  plan  (Brown  [1993a])  for  which  a  CLT  has  been 
proved  (Brown  [1993b])  and  bootstrap  algorithm  has 
been  developed  and  strong  uniform  consistency  of  the 
sample  mean  has  been  proved  (Brown  [1993c]). 


Background 

Spherical  data  arises  in  many  disciplines:  astrophysics 
(star  clusters),  health  sciences  (MRI,  contaminants),  ge¬ 
ology  (oil,  earthquakes),  meteorology  (ozone,  pollution), 
and  geography  (water  levels,  coast  line)  just  to  name  a 
few.  Sampling  plans  play  a  major  role  in  characterizing 
a  random  field  and  the  dependence  structure  of  statis¬ 
tics  defined  on  the  random  field.  In  particular,  creating 
confidence  intervals  and  conducting  hypothesis  tests  on 
statistics  are  directly  related  to  the  sampling  plan. 

Unfortunately  there  is  no  generally  accepted  way  to 
gather  spherical  data.  In  particular,  we  would  like  a 
global  sampling  plan  upon  which  we  can  prove  a  CLT 
and/or  create  a  resampling  algorithm.  Up  until  1993, 
the  only  sampling  plan  for  which  a  CLT  has  been  proved 
is  for  the  continually  indexed  sphere  (Leonenko  and  Ya- 
drenko  [1979]). 

The  most  common  way  to  prove  a  CLT  for  depen¬ 
dent  data  is  to  use  a  characteristic  of  the  random  field 
known  as  stationarity  (translation  invariance)  and  the 
big-block,  little-block  methodology  and  a-mixing  to  re¬ 
duce  the  problem  to  the  iid  setting.  Using  these  ideas 


when  the  sample  size  n  is  very  large,  if  the  small  blocks 
are  small  in  size  compared  to  the  big  blocks,  but  still 
large  enough  to  separate  the  big  blocks  by  a  substantial 
amount,  then  the  big  blocks  act  almost  independently 
(a-mixing)  while  the  small  blocks  are  negligible  com¬ 
pared  with  the  big  blocks.  The  stationary  insures  that 
the  statistic  defined  on  the  big  blocks  are  iid.  Note  that 
when  working  with  spherical  data,  we  assume  that  the 
random  field  is  isotropic  or  rotation  and  translation  in¬ 
variant. 

Resampling  algorithms  are  usually  employed  when  in¬ 
terest  is  in  a  parameter  9  of  some  distribution  F  and  the 
estimate  of  9  is  cumbersome  and  the  calculations  of  the 
distribution  of  the  estimate  are  intractable.  Usually  one 
wishes  to  create  confidence  intervals  for  9  and/or  do  hy¬ 
pothesis  testing  on  9;  a  resampling  algorithm  estimates 
the  true  distribution  of  the  statistic  and  this  estimated 
distribution  is  used  in  the  inference. 

In  this  setting,  we  collect  data  {Xi ,  X2, . . Xn)  =  -Yn 
from  Fj  use  a  statistic  ffj  —  ffi(-^^fi)  that  estimates  9  ^  and 
determine  the  distribution  of  in .  The  field  of  resampling 
tries  to  estimate  the  distribution  of  tn  by  reusing  the 
data  at  hand  to  create  more  samples  and  hence  more 
statistics.  We  investigate  the  bootstrap  here,  but  there 
are  many  other  resampling  methods. 

In  1979,  Efron  described  an  resampling  method  called 
the  bootstrap  for  iid  data.  This  method  is  paraphrased 
as  follows:  from  data  Xi,  X2,  . . Xn,  calculate  the 
empircal  distribution  function  of  the  data  Fn{x)  = 
^  Resample  n  observations  iid  from 

Fn{x)  to  create  a  bootstrap  sample  X^*  =  (X^ ,  X2 ,  . . 
X*).  Calculate  a  bootstrap  statistic  of 

Repeat  this  procedure  B  times  and  use  the  distribution 
of  the  as  an  estimate  of  the  distribution  of 

In  fact,  the  true  bootstrap  estimate  of  the  mean  of  in 
is  fiBoot  =  which  is  estimated  by 


pBoot 


=  i.  =  t*, 

D  /  j  m  *'n  > 
6=1 


and  the  true  bootstrap  estimate  of  the  variance  of  tn  is 
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which  is  estimated  by 


^Boot 


1 

B  -  1 


B 


Carlstein  [1986]  extended  Efron’s  bootstrap  to  time- 
series  data  by  creating  the  non-overlapping  blockwise 
bootstrap.  His  method  creates  blocks  with  identical  joint 
distributions  (due  to  the  stationarity  of  the  time-series), 
the  blocks  are  then  treated  as  the  Xi  in  the  iid  setting. 
In  particular,  from  the  n  observations  of  the  time-series, 
let  k  —  and  Si  ■^*7+2)  •  •  • » 

the  ith  block  of  /  observations.  The  stationarity  insures 
that  the  statistics  defined  on  the  k  blocks  Bi  have  the 
same  joint  distribution.  We  resample  k  blocks  from  F/, 
the  empirical  disrtibution  function  of  the  length  I  blocks, 
and  join  them  together  to  form  a  bootstrap  time-series. 
Calculate  the  statistic  on  the  bootstrap  time-series  and 
repeat  B  times. 

Kiinsch  [1989]  extended  this  method  to  the  overlap¬ 
ping  blocks  case.  Here  there  are  n  — /-|-1  blocks  of  length 
/,  Bi  =  (Xi+i,  Xi^2)  •  •  • )  blocks  now  over¬ 

lap,  whereas  before  they  did  not.  We  again  resample  k 
blocks  from  this  collection  and  repeat  the  above  process. 
In  comparison  to  the  nonoverlapping  case,  this  method 
reduces  the  variance  of  the  estimate  of  variance  of  the 
sample  mean  by  1/3. 

Therefore,  in  order  to  prove  a  CLT  and  create  a  block- 
wise  resampling  algorithm  it  is  necessary  for  a  global 
sampling  plan  to  have  separating  blocks  for  the  big- 
block,  little-block  theory  and  repeating  patterns  for  the 
isotropy  of  the  random  field. 


Sampling  Plans 

There  are  two  basic  approaches  for  creating  global  sam¬ 
pling  plans:  experimental  design  considerations  and  ge¬ 
ographical  considerations.  The  experimental  design  ap¬ 
proach  does  not  necessarily  generate  designs  which  have 
repeating  patterns  that  are  necessary  in  a  blockwise  re¬ 
sampling  algorithm,  but  they  have  design  optimality 
properties  for  certain  models.  On  the  other  hand,  ge¬ 
ographical  sampling  methods  are  used  to  create  designs 
with  repeating  patterns,  but  do  not  have  the  design 
optimality  property.  Geographical  sampling  plans  fall 
into  one  of  two  types:  polyhedral  tessellations  and  map- 
projections. 

Experimental  Designs 

The  experimental  design  approach  begins  with  the  fol¬ 
lowing  setup:  Consider  the  specific  model  with  k  vari- 


I-optimal 

min  tracej  M  ^ } 

where  M  =  /  f{x)f{x)dfi{x) 

A-optimal 

min  trace  {M;^  } 

D-optimal 

mindet{Mx}““^^^ 

E-optimal 

min  max*  ei{M^^) 

where  ei(-)  are  the  eigenvalues 

G-optimal 

:  minmax/e  V  {y(a:)} 

Table  1:  Optimality  Criterion 


ables  Xi, . . . ,  arjb,  p  =  |(A!+l)(Ar-{-2)  unknown  parameters 
/?,  and  error  term  e  with  mean  0  and  variance 

k  k  Jb-l  k 

y  =  Po  +  Y) 

Let  (xji, . . . ,  be  a  design  point  in  the  region  of  op¬ 
erability  O  and  X  be  the  design  matrix  containing  rows 
/(x)  =  (1, xi, . . . ,  x*,xf , . . . ,  X^,X1X2, ....  Xfc-iXi).  The 
moment  matrix  is  then  Mx  =  X^Xfn  and  the  prediction 
variance  is  V  {y(a:)}  =  f  {x)M^^f{x)(T^/n.  If  we  let  R 
be  the  modeling  region  and  //(•)  be  a  uniform  measure 
over  R  with  total  measure  1,  then  we  can  then  choose 
design  points  so  as  to  minimize  any  one  of  the  criterions 
in  Table  1. 

In  1993,  Hardin  and  Sloane  introduced  a  computer 
algorithm  called  G OSSET  that  used  a  modification  of  the 
pattern  search  method  of  Hooke  and  Jeeves  [1961].  The 
algorithm  uses  the  gradient  of  a  differential  function  to 
find  the  minimum  and  hence  it  is  able  to  find  I-,  A-, 
or  D-  optimal  designs,  not  E-  and  G-optimal.  It  can 
be  used  with  very  complicated  O  and  R  (balls,  cube, 
hyperplanes,  and  intersections  and  unions). 

We  are  interested  in  balls.  Unfortunately,  when  the 
sample  size  gets  large,  the  sampling  plans  created  by 
GOSSET  do  not  have  repeating  patterns. 

Polyhedral  Tessellations 

The  polyhedral  tessellation  sampling  plans  usually  start 
from  one  of  the  5  platonic  solids:  tetrahedron,  hexa¬ 
hedron  (cube),  octahedron,  dodecahedron,  and  icosohe- 
dron.  The  solid  is  then  inscribed  in  a  sphere  and  its  edges 
are  projected  onto  the  sphere  as  great  arcs.  Most  of  the 
tessellations  then  apply  the  alternate  method  of  Gasson 
[1983].  His  method  states  that  each  spherical  triangle 
can  be  recursively  subdivided  into  four  subtriangles  by 
placing  vertices  at  the  midpoint  of  an  edge  and  joining 
the  new  vertices.  This  method  has  relation  to  geodesic 
domes  (Popko  [1968]). 
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Dutton  [1989]  uses  the  octahedron  as  a  basis  for  a 
quaternary  triangular  mesh  (QTM) .  Here  each  face  is  a 
spherical  triangle  and  is  recursively  subdivided  using  the 
alternate  method.  Dutton  points  out  that  one  can  get  1 
meter  resolution  in  21  recursions.  Goodchild  and  Shiren 
[1989]  provided  a  conversion  to  the  latitude- longitude 
scale  since  in  this  setting  the  ‘‘base”  edges  for  the  octa¬ 
hedron  and  its  subdivisions  are  parallel  to  the  equator. 
Unfortunately  this  method  does  not  subdivide  into  equal 
area  cells  nor  equal  shapes. 

Wickman,  Elvers,  and  Edvarson  [1974]  use  the  dodec¬ 
ahedron  as  basis  for  their  method.  Here,  each  face  is  a 
pentagon  and  is  first  subdivided  into  5  isosceles  triangles. 
One  then  recursively  subdivides  the  isosceles  triangles  by 
the  alternate  method.  The  sphere  can  be  subdivided  into 
equal  area  pieces,  but  they  have  different  shapes. 

Fekete  [1990]  uses  the  icosohedron  as  basis  for  a  sphere 
quadtree  (Samet  [1984]).  In  his  approach,  each  face  is 
a  triangle  and  the  alternate  method  is  applied  to  each. 
This  quadtree  does  not  subdivide  into  equal  area  pieces, 
but  there  is  less  distortion  of  size  and  shape  than  the 
QTM  method. 

White,  Kimerling,  and  Overton  [1992]  use  the  trun¬ 
cated  icosohedron  as  a  basis  for  their  method.  The  trun¬ 
cated  icosohedron  has  faces  that  are  both  pentagons  and 
hexagons  and  is  the  common  design  for  soccer  balls. 
They  begin  by  decomposing  the  pentagons  into  5  tri¬ 
angles  and  the  hexagons  into  6  triangles.  They  then  ap¬ 
ply  the  alternate  method  subdivision  on  each  triangular 
face.  Their  method  also  does  not  subdivide  into  equal 
area  pieces,  but  there  is  less  distortion  of  size  and  shape 
than  with  the  icosohedral  method  within  each  face  type. 


Map-Projections 

The  map-projection  approaches,  on  the  other  hand,  use 
the  latitude- longitude  grid  as  a  starting  point.  Mark 
and  Lauzon  [1985]  proposed  a  system  based  upon  the 
Universal  Transverse  Mercator  (UTM)  which  is  used  by 
most  military  agencies  around  the  world.  They  begin 
by  dividing  the  60  UTM  zones  into  north  and  south 
subzones.  Each  subzone  is  then  subdivided  into  square 
patches  within  which  they  define  a  256  X  256  array  of 
cells.  This  method  coexists  nicely  with  present  maps, 
however  the  boundaries  between  zones  introduce  slight 
unconformities. 

Tobler  and  Chen  [1986]  proposed  a  Lambert  cylindri¬ 
cal  equal- area  projection.  This  method  retains  latitude- 
longitude  ideas  to  create  equal  area  cells.  Unfortunately, 
the  variation  in  shape  is  tremendous  from  nearly  square 
at  the  equator  to  long,  thin  spherical  rectangles  near  the 
poles. 


Brown  [1993a]  introduced  the  stratified  spherical  sam¬ 
pling  plan  (SSSP)  which  uses  a  latitude-longitude  struc¬ 
ture  and  creates  nearly  equal  area  rectangles  throughout 
the  sphere.  This  method  does  not  have  the  distortion  of 
the  Tobler  and  Chen  method.  Here,  the  sphere  is  cut 
into  “wafers”  that  are  cut  parallel  to  the  equator  (such 
as  the  area  between  the  70  and  80  degree  latitudinal 
lines  on  a  globe).  Upon  each  wafer  a  specific  latitude- 
longitude  grid  is  constructed  to  create  almost  equal  area 
pieces  where  distance  (horizontal  and  vertical)  is  asymp¬ 
totically  preserved  within  and  across  wafers. 

Each  SSSP  is  made  up  of  5  parts:  the  northern  cap 
CN{r),  the  southern  cap  Cs{r),  the  northern  hemisphere 
jy^(r),  the  southern  hemisphere  Hs{r)^  and  the  equa¬ 
torial  region  E’(r).  The  northern  and  southern  caps  and 
the  equatorial  region  are  used  as  little  blocks  and  sep¬ 
arate  the  two  hemispheres  that  drive  the  distribution 
theory. 

They  can  be  explicitly  calculated  by  using  func¬ 
tions  ^l{r),^{r),9ru{r),  and  integer  sequences  Jr  and 
Ur,  where  9^{r)  and  <?^(r)  are  the  horizontal  and  ver¬ 
tical  generating  angles  of  the  latitude-longitude  grid  on 
wafer  w\  Jr,  and  Vr  are  the  number  of  (t>{r)  vertical  an¬ 
gular  increments  in  each  wafer  and  equatorial  region, 
respectively,  and  7i(r)  is  used  to  calculate  the  top  of  the 
first  wafer.  From  these  quantities,  we  can  calculate  Wr, 
the  number  of  wafers  that  the  sphere  is  partitioned  into, 
n«;,r,  the  number  of  9^{r)  angles  that  go  around  wafer 
w,  and  7ti;(?")>  the  vertical  angle  to  the  top  of  wafer  w. 

Denote  a  point  P  on  a  sphere  of  radius  r  by  its  spheri¬ 
cal  coordinates  P  =  (r,  where  9  is  the  angle  between 
the  positive  x-axis  and  the  ray  from  the  origin  to  P*,  the 
projection  of  P  onto  the  xy-plane,  and  <j)  is  the  angle  be¬ 
tween  the  positive  z-axis  and  the  ray  from  the  origin  to 
P. 

Given  functions  Ji{r),(j>{r),9^{r),  and  integer  se¬ 
quences  Jr  and  Vr,  calculate  Wr,nu}^r,  and  7tz;(r),  math¬ 
ematically,  by  first  calculating 


Ur 


1 

4>{r)Jr 


•{7r-27j(r)}-  ^ 


and  then  put  Wr  =  Ur-  2z*,  where  z*  €  [0, 1)  is  chosen 
so  that  Wr  is  an  even  integer.  Then  define  the  vertical 
angle  to  the  top  of  the  first  wafer  as  71  (r)  =  7]‘(r)  + 
z*Jr^(r).  For  1  <  w  <  Wr/2,  define  the  vertical  angle  to 
the  top  of  wafer  w  and  the  number  of  dw  (»’)  angles  that  go 
around  wafer  w  as  7u,(r)  =  7i(»*)  +  (tn— l)Jr<^(0>  ~ 

nwr+i-w,r  =  and  for  the  equatorial  region, 

7js(r)  =  7i(r)  +  WrJrHr)/2  and  ns.r  =  \2T^/^E{r)\- 
For  the  hemispherical  and  equatorial  regions,  we 
sample  at  the  vertices  of  the  wafer-specific,  latitude- 
longitude  grid.  Since  the  shape  of  each  cap  is  topologi- 


JJ,  Brown  51 


cally  different  than  that  of  the  wafers,  we  use  a  mod¬ 
ified  hexagonal  sampling  plan  (Matern  [1986]),  which 
provides  circular  symmetry  within  the  cap. 

Define 

Wr/2y,.  — 

HN{r)  =  U  U  U 

to=i  i=o  »=o 

W'r/2 

ffsW  =  U  U  U 

ii;=l  j=:0  1=0 

where  in  this  range  for  1w{t)  + 

j(^{r))  and  =  (r,  ioJir),  Jr-(7,«(r)+j<^(r))). 

Define 

Vr  — 1 

E{r)=  U  U 

J=:0  f=0 

where  P^j{r)  =  (r,  iOslr),  TB(r)  +  j^(r)).  Define 


C'5(r) 


(r,0,0)U 

(r,0,7r)U 


U  and 

i=i  <=i  / 

U  U 

i=i  .=1  / 


where  i^^(r)  =  (r,  iir/(3i),  j>(r))  and  =  (r, 

iir/(Zj),  TT  -  jXr)).  A  SSSP  can  now  be  given  by  B 
=  CN{r)  U  HN(r)  U  E{r)  U  Hs{r)  U  Cs(r). 

The  SSPSs  are  the  only  finite  global  sampling  plans 
upon  which  a  CLT  has  been  proved  (Brown  [1993b]). 
In  addition,  this  is  the  only  global  sampling  plan  upon 
which  a  resampling  (overlapping  bootstrap)  algorithm 
has  been  designed  and  strong  uniform  consistency  has 
been  proved  for  the  sample  mean  (Brown  [1993c]). 
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Abstract 

The  covering  method  algorithm  can  be  used  to  calculate 
power-law  feature  vectors  based  on  the  local  texture  in  an 
image.  These  features  can  then  be  used  for  distinguishing 
between  different  types  of  textures.  We  present  a  new 
method  of  calculating  local  fractal-based  features  in  the  pres¬ 
ence  of  a  continuous-valued,  irregular  and/or  incomplete 
segmentation  by  use  of  a  Dijkstra  potential  map.  This 
method  produces  more  accurate  power-law  features  for  pix¬ 
els  near  a  segmentation  boundary  by  altering  the  size  and 
shape  of  the  local  neighborhood  in  which  the  calculations 
take  place,  thereby  producing  a  more  texturally  pure  neigh¬ 
borhood.  This  leads  to  improved  texture  discrimination  since 
the  contribution  of  multiple  textures  to  the  calculation  of  a 
given  feature  vector  is  reduced  or  eliminated. 

1.  Introduction 

To  oversimplify,  those  who  have  studied  the  utility  of 
using  fractal  dimension  for  discriminant  analysis  in,  say,  bio¬ 
logical  images  can  be  grouped  into  one  of  two  categories. 
There  are  those  who  feel  the  information  inherent  in  the  frac¬ 
tal  dimension  of  a  texture  should  be  useful  for  distinguishing 
certain  classes  of  tissue  even  though  few  conclusive  studies 
have  yet  been  presented,  and  those  for  whom  the  results 
obtained  thus  far  are  unconvincing  enough  to  warrant  a  deci¬ 
sion  to  move  on  to  other  approaches.  The  optimists  feel  a 
system  utilizing  fractal  dimension  in  conjunction  with  other 
information  and  techniques  will  be  superior  to  a  system 
which  fails  to  utilize  any  type  of  textural  information.  This 
paper  presents  one  reason  why  the  results  obtained  thus  far 
are  less  impressive  than  some  have  expected,  introduces  a 
new  methodology  for  extracting  fractal  dimesion  features 
which  circumvents  this  cause,  and  indicates,  finally,  that  this 
modified  approach  to  fractal  dimension  does  indeed  live  up 
to  the  potential  for  which  the  optimists’  have  long  held  out. 

Section  2  presents  a  description  of  a  modification  of  the 
covering  method  algorithm  for  estimating  fractal  dimension 


which  incorporates  segmentation  boundaries.  A  qualitative 
comparison  of  the  procedure  with  the  standard  covering 
method  is  presented  in  Section  3.  Probability  density  esti¬ 
mates  for  the  extracted  feature  vectors  are  developed  and 
compared.  Examples  are  presented  for  a  standard  texture 
benchmark  and  for  tumor  detection  in  X-ray  mammography. 
It  is  shown  that  there  is  significantly  more  discriminatoiy 
information  in  the  texture  features  when  they  are  extracted 
via  the  new  method. 

2.  Approach 

Richardson’s  power  law  (Mandelbrot,  1977)  provides  a 
functional  relation  between  a  measured  property  of  a  fractal 
and  a  measurement  scale.  The  function  is  given  by 

M(e)  =  (1) 

where  M{t)  is  a  measured  property  of  a  fractal  at  scale  8,  K  is 
a  constant  of  proportionality,  d  and  D  are  the  topological  and 
fractal  dimensions,  respectively.  Taking  the  logarithm  of  Eq. 
(1)  provides  the  slope  and  y-intercept  of  a  best-fit  line 
through  iog(M(ei))  for  a  set  of  scales  {Ej}  as  a  set  of  power 
law  features. 

The  property  M(z)  we  wish  to  measure  is  the  surface 
area  of  the  image  about  a  pixel  and  can  be  estimated  using 
the  covering  method  (Peleg,  et  al.,  1984).  The  covering 
method  typically  consists  of  three  steps:  recursive  applica¬ 
tion  of  dilation  and  erosion  operators  to  calculate  upper,  U, 
and  lower,  L,  bounding  surfaces  for  scales  e^, ...,  cal¬ 
culation  of  an  averaged  surface  area,  A,  at  each  scale  from  U 
and  L;  and  calculation  of  power  law  features  from  A.  When 
two  or  more  textures  are  present  in  an  image  the  morpholog¬ 
ical  operators  and  the  averaging  process  will  both  lead  to 
erroneous  estimates  of  A  and  thus  the  derived  features  if 
boundaries  between  textures  are  not  accounted  for.  To  rem¬ 
edy  the  errors  due  to  the  dilation  and  errosion  operators  we 
utilize  the  modified  dilation  and  erosion  operators  (Rogers, 
et  al.  1993,  and  Julin  et  al.  1994) 
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=  win.  +  p],  ■  ,(2b) 


where  if  is  the  upper  surface  the  lower  surface  at  scale  e 
and  iy]  are  the  row  and  column  indices  respectively.  5  is  a 
continously  valued  segmentation  map  with  5  6  [0,1],  where 
5  =  0  for  the  strongest  possible  segentation  boundary  and 
5  =  1  for  no  boundary. 

The  upper  and  lower  surfaces  at  scale  zero  are  given  by 


where  G .  j  is  the  original  gray  scale  image. 

It  is  customary  to  utilize  the  average  area  formula  of 
Peli  (Peli,  1990), 


to  reduce  the  variation  of  the  area  from  pixel  to  pixel.  In  this 
method  the  averaging  window  W  =  W(e)  such  that  at  scale 
m  the  window  about  /j  should  be  larger  than 
(2m  +  1)  X  (2m  +  1)  so  that  the  window  contains  sufficient 
uncorrelated  values.  However,  when  the  window  encom¬ 
passes  multiple  textures  the  averaging  process  is  a  source  of 
error. 

To  reduce  or  eliminate  the  effects  of  averaging  multiple 
textures  we  introduce  a  boundary  observing  adaptive  kernel 


based  on  Dijkstra  potentials  (Dijkstra,  1959).  In  this 
approach  a  potential  is  calculated  about  every  pixel  in  the 
image  from  costs  defined  below.  The  potential  is  then  uti¬ 
lized  in  constructing  a  kernel  for  computing  the  average  area 
about  each  pixel. 

In  the  current  calculations  two  types  of  costs  are  consid¬ 
ered.  The  first  is  the  cost  based  on  the  shortest  possible  path 
from  the  current  pixel  to  the  window’s  central  pixel.  The  dis¬ 
tance  used  for  the  current  calculations  is  based  not  on  the 
(physical)  distance  between  pixel  centers,  but  rather  on  the 
number  of  steps  required  to  move  from  the  current  pixel  to 
the  central  pixel.  This  cost  is  dependent  upon  the  type  of 
connections  we  allow  between  pixels.  For  example,  the  cost 
of  connecting  pixel  to  k,l  would  be  2  if  we  constrain 

connections  to  the  north,  east,  west,  south  four  nearest  neigh¬ 
bors  (first  we  must  move  to  or  then  to  kyl). 

The  second  type  of  cost  is  that  of  being  coincident  or 
adjacent  to  a  boundary  pixel.  For  a  binary  boundary  (5=0  or 
1  only)  this  cost  is  set  to  an  arbitrarily  large  value.  If  a  pixel 
is  not  adjacent  to  a  boundary  pixel  this  cost  is  zero.  For  con¬ 
tinuously  valued  segmentaion  boundaries  we  utilize  the  cost 
function 

C.  J  =  a  ( 1  -  mm  (5.  S.,  y) ) ,  (6) 

where  the  prime  denotes  pixels  within  the  neighborhood  and 
a  is  a  parameter  describing  the  amount  of  information 
allowed  to  cross  the  boundary.  Other  types  of  costs  or  cost 
functions  are  easily  implemented. 

Once  the  costs  have  been  computed  the  four  nearest- 
neighbor  recursive  potential  update  equation, 

1 

=  mm^  V^_  +  ^  (7) 

\/  kyl  e  W^j 

is  iterated  to  convergence.  Here  ^is  the  potential  at  step  a 
and  ;  the  sum  of  costs  at  pixel  kyl  with 

yO  ^  jo  ifkyl  =  iyj 

ki  I  CO  otherwise 

In  the  present  study  we  have  utilized  a  window  of  fixed 
“radius,”  r,  le,  the  window  is  of  size  (2r+ 1)  x  (2r+  1) ,  as 
opposed  to  the  variable  window  of  Peli.  We  feel  that  this  is 
appropriate  as  long  as  r  >  We  note  that  it  will  be  possi¬ 
ble  for  the  kernel  to  be  smaller  than  the  window  in  the  vicin¬ 
ity  of  a  boundary. 
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We  may  now  utilize  the  Dijkstra  potentials  in  the  calcu¬ 
lation  of  the  area  about  pixel  i,j  by  performing  a  weighted 
summation  over  the  window  using 


A 


e 


I 

KU  W,j 


Z  ^k,l 

k,le.  W,  j 


(9) 


where  ,  is  the  area  calculated  by  Eqn.  (5)  at  pixel  k,l  and 
w  is  a  weight  function  based  on  the  Dijkstra  potential  given 
by,  say, 


Otherwise, 


(10) 


for  a  square  kernel  where  X  is  a  parameter.  In  Section  3 
below  we  use  e  =  5,  r  =  8  ,  and  a  =  X  =  16. 

max 

3.  Results 

In  this  section  we  present  the  results  of  using  the  above 
technique  on  two  illustrative  examples.  The  first  consists  of 
considering  the  estimate  of  the  y-intercept  value  from  two 
Brodatz  texture  patches  (Brodatz,  1966).  The  ability  to 
obtain  a  good  estimate  in  the  region  of  transition  between  the 
two  textures  yields  superior  performance  in  a  change  point 
detection  scenario.  The  second  example  presented  considers 
an  x-ray  mammogram  and  investigates  the  ability  to  distin¬ 
guish  a  tumorous  region  from  the  healthy  tissue.  Here  we 
consider  the  estimate  of  the  fractal  dimension  itself.  In  both 
examples  the  incorporation  of  boundary  information  into  the 
calculation  of  our  features  is  vital  to  obtaining  an  acceptable 
level  of  performance.  Probability  density  functions  are 
developed  using  the  method  of  adaptive  mixtures  (Priebe  and 
Marchette,  1993)  and  utilize  the  imposed  measure  methodol¬ 
ogy  (Priebe,  et  al.,  1994). 

3.1  Example  1 

Given  two  textures  from  Brodatz  (Fig  1.1)  we  consider 
three  regions.  The  leftmost  box  (box  1)  superimposed  on  the 
textures  in  Figure  1 . 1  is  well  within  the  interior  of  the  left  tex¬ 
ture  and  can  reasonably  be  considered  a  region  of  pure  texture 
1  (D17  of  Brodatz).  Similarly,  the  rightmost  box  (box  3)  is  a 
region  of  pure  texture  2  (D24  of  Brodatz).  The  middle  box 
(box  2)  stradles  the  boundary  between  the  two  textures.  This 
border  region  contains  some  pixels  from  texture  1  and  some 
from  texture  2,  as  well  as  the  boundary. 

Figure  1.2  shows  (as  solid  lines)  the  pdfs  obtained  from 
the  pure  textures  in  boxes  1  and  3,  calculated  separately. 
These  pdfs  for  the  different  textures  are  well  separated  when 
the  regions  considered  are  far  from  the  border  and  hence  uni¬ 


form  in  texture.  The  dashed  line  in  Figure  1.2  shows,  how¬ 
ever,  that  when  we  consider  a  border  region  (box  2)  the  errors 
arising  from  calculating  power  law  features  over  a  region 
containing  two  distinct  textures  makes  it  impossible  to  deter¬ 
mine  the  structure  of  the  region.  This  pdf  does  not  convey  the 
fact  that  the  region  considered  contains  exactly  two  distinct 
textures.  The  dotted  curve  in  Figure  1.2  indicates  the  pdf  of 
the  border  region  (box  2)  when  a  priori  boundary  information 
(iS  =  0  or  1)  is  incorporated  into  the  calculation  of  the  power 
law  features.  It  is  obvious  from  this  pdf  that  the  region  being 
considered  is  simply  made  up  of  two  subregions  with  charac¬ 
teristics  corresponding  to  those  in  boxes  1  and  3.  This  supe¬ 
rior  information  is  easily  translated  into  superior  performance 
in  discriminant  analysis  or  change  point  detection  scenarios. 


Figure  1.1. 

Two  adjacent  texture  patches  and  the  three  regions 
(numbered  1  through  3  from  the  left)  used  in  Example  1. 


Figure  1 .2. 

Pdfs  of  the  y-intercept  feature  for  the  three  regions  from 
Figure  1.1. 


3.2  Example! 

For  example  2  we  consider  the  mammogram  shown  in 
Figure  2.1.  We  will  focus  on  the  boxed  region  in  the  upper 
right.  This  region  contains  a  tumorous  region  (biopsy  veri- 
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with  the  radiologist’s  boundary  drawn  in.  We  consider  two 
disjoint  regions.  The  tumorous  region  (region  1)  is  the  region 
within  the  radiologist’s  boundary.  The  healthy  region  (region 
2)  is  the  area  simultaneously  within  the  box  and  outside  the 
tumorous  region. 


Figure  2. 1 

Mammogram  used  in  Example  2  with  radiologist’s 
boundary  of  tumorous  region  overlaid. 


Figures  2.2  and  2.3  show,  respectively,  pdfs  for  the  two 
regions  when  the  true  boundary  has  been  incorporated  into 
the  calculation  of  the  features  (2.2)  and  when  no  boundary  is 
used  (2.3).  We  clearly  see  that  the  presence  of  the  boundary 
in  the  feature  extraction  is  vital  to  the  utility  of  the  features  for 
distinguishing  tumorous  tissue  from  healthy  tissue. 

Unfortunately,  obtaining  a  true  boundary  like  that  shown 
in  Figure  2.1  and  used  in  Figure  2.2  is  costly  and  time  con¬ 
suming.  Furthermore,  the  ultimate  utility  of  this  procedure 
for  a  real  application  depends  on  the  ability  to  automatically 
generate  a  boundary  that  will  be  useful  in  this  context.  Figure 
2.4  shows  the  radiologist’s  boundary  superimposed  on  a  par¬ 
ticular  wavelet  segmentation  map.  This  wavelet  map  is  by  no 
means  perfect.  The  boundary  is  not  closed,  it  is  not  necessar¬ 
ily  exactly  coincident  with  the  radiologist’s  boundary,  it  is 
continuously  valued  rather  than  binary,  and  there  is  noise. 


Nevertheless,  it  generally  marks  the  edge  of  the  tumorous 
region.  When  this  boundary  is  used  in  the  feature  extraction 
the  resultant  pdfs  are  depicted  in  Figure  2.5.  We  see  that  the 
separation  of  the  two  classes  is  maintained  to  a  degree  similar 
to  that  obtained  when  the  radiologist’s  boundary  was 
employed.  Discriminant  analysis  could  be  successfully  pur¬ 
sued  here,  as  in  Figure  2.2,  while  Figure  2.3  (the  no  boundary 
case)  leaves  little  hope. 


h - 1 - 1 - } - 1 - 1 - r 

-.3  0 
Figure  2.2 


Pdfs  for  fractal  dimension  from  Example  2,  calculated  using 
the  radiologist’s  boundary.  Solid  curve  is  tumorous  tissue, 
dashed  curve  is  healthy  tissue. 
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Figure  2.3 

Pdfs  for  fractal  dimension  from  Example  2,  calculated  with 
no  boundary  information.  Solid  curve  is  tumorous  tissue, 
dashed  curve  is  healthy  tissue. 
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Incomplete,  grayscale  wavelet  segementation  map  with 
radiologist’s  boundary  overlaid.  This  continuous  valued 
map  is  used  for  Figure  2.5. 


20- 


Figure  2.5 

Pdfs  for  fractal  dimension  from  Example  2,  calculated 
using  the  continuous  valued  wavelet  boundary  boundary 
from  Figure  2.4.  Solid  curve  is  tumorous  tissue,  dashed 
curve  is  healthy  tissue. 


4.  Discussion 

The  examples  presented  in  Section  3  indicate  that  the 
utility  of  fractal  dimension  features  for  texture  discrimination 
hinges  on  calculating  the  features  in  regions  of  uniform  tex¬ 
ture.  For  applications  in  which  one  necessarily  must  consider 
border  regions  between  different  textures  the  standard  calcu¬ 
lations  do  not  provide  the  necessary  capabilities.  Incorporat¬ 
ing  a  segemntation  boundary  into  the  calculation  of  the 


texture  features,  whether  it  be  a  true  boundary  known  a  priori 
or  a  boundary  map  estimated  through  a  wavelet  or  other  algo¬ 
rithm,  greatly  improves  the  discrimination  capabilities  one 
can  expect. 

It  is  argued  that  this  modificaiton  must  be  considered  in 
any  evaluation  of  the  utility  of  power  law  features  for  dis¬ 
criminant  analysis,  change  point  detection  ,  or  homogeneity 
analysis  whenever  texture  boundaries  come  into  play. 
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Abstract^ 

Feedforward  neural  networks  are  widely  used  as  a  black 
box  prediction  technique.  Recent  work  of  Barron  (1991) 
shows  that  these  models  are  very  well  suited  to  approx¬ 
imating  structure  in  high  dimensions.  This  raises  the 
issue  of  how  well  they  find  spurious  structure  in  noise. 

This  paper  presents  a  diagnostic  based  on  Aldous’s 
Poisson  clumping  heuristic  that  describes  the  extent  to 
which  nets  can  overfit,  where  in  the  data  such  spurious 
overfitted  units  are  likely  to  arise  and  how  many  local 
optima  the  sum  of  squared  error  surface  (as  a  function 
of  the  network  weights)  is  expected  to  have. 

The  diagnostic  is  simplest  for  the  case  of  a  single  hid¬ 
den  unit,  but  extends  in  principle  to  more  general  prob¬ 
lems. 

1  Introduction 

We  consider  a  nonlinear  regression  model  of  the  form 
Y  =  ^{X)  4-  6.  Here  the  response  variable  Y  is  the  sum 
of  a  signal  fi{X)  and  a  noise  random  variable  e  with 
mean  0  and  variance  .  The  signal  is  a  function  of  X , 
a  vector  of  predictor  variables.  The  form  of  the  signal  is 

J 

where  the  are  scalar  parameters  (‘‘weights”),  the  (f>j 
are  real  valued  “activation”  functions  and  the  6j  are  vec¬ 
tor  valued  parameters. 

The  model  (1)  is  an  example  of  an  artificial  neural 
network  model.  This  special  case  is  known  as  a  feed¬ 
forward  network  with  a  single  hidden  layer  and  a  lin¬ 
ear  output  unit.  See  Hertz,  Krogh  and  Palmer  (1991, 
Chapters  5,6)  for  an  introduction  to  these  models.  Com¬ 
monly  used  activations  are  sigmoids  such  as  <^(X,0)  = 

^This  work  supported  by  NSF  grants  DMS-9011074  and  94- 
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(1  4- exp(-A’'^))“'^  and  Gaussian  radial  basis  functions 
such  as  ^{X,d)  =  exp(-||X  -  e\\^/2r^).  In  the  sigmoid 
above,  X  usually  includes  a  component  that  is  always  1 
and  the  corresponding  intercept  component  of  6  is  known 
as  a  “bias” .  In  the  radial  basis  function  the  parameter  r 
is  a  measure  of  scale  that  could  either  be  subsumed  into 
0  or  held  fixed. 

2  Asymptotics  and  Redundant 
Units 

Estimation  of  model  (1)  is  usually  based  on  training  data 
consisting  of  n  independent  observations  (XijYi),  Let 
0j  and  Qj  denote  estimates  of  the  parameters  and  fi{X) 
denote  the  resulting  estimate  of  signal. 

If  model  (1)  holds,  then  mild  assumptions  on  the  dis¬ 
tribution  of  €  and  identifiability  assumptions  on  Wj ,  ,  6j 

produce  the  usual  asymptotics  as  n  — >■  oo  for  //  estimated 
by  minimizing  squared  error  '  The  pa¬ 

rameters  are  estimated  consistently  (up  to  some  permu¬ 
tations  of  labels  which  don’t  matter)  and  are  asymptot¬ 
ically  normally  distributed.  The  mean  squared  error  on 
the  training  data  is  smaller  than  (7^,  but  this  optimism 
is  simply  accounted  for  by  adjusting  for  the  degrees  of 
freedom  used  in  fitting  the  model.  For  details  see  White 
(1989). 

These  asymptotics  are  suspect  for  the  problem  at 
hand.  Partly  this  is  because  typical  applications  use  a 
very  large  number  of  parameters.  When  a  large  number 
of  parameters  are  in  use,  the  identifiability  assumption 
becomes  questionable.  The  model  (1)  is  not  identifiable, 
if  for  example  a; jr  =  0,  for  then  the  corresponding  9j  has 
no  effect  on  Y .  There  is  thus  no  “true  value”  for  0j  and 
estimates  of  it  have  nothing  to  converge  to.  This  un¬ 
dermines  the  usual  approach  to  asymptotic  theory.  The 
unit  <t>j{X^0j)  is  said  to  be  redundant,  and  the  corre¬ 
sponding  estimate  <^j{X^0j)  is  said  to  be  spurious. 

It  is  unlikely  in  practice  that  an  exactly  redundant 
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unit  will  be  encountered.  But  in  a  model  with  many 
units  it  is  reasonable  that  some  of  the  Wj  will  be  close  to 
zero  and  hence  that  some  units  are  nearly  redundant. 

Redundant  unit  asymptotics  are  like  those  of  broken 
line  regression.  Here  X  is  a  scalar  and 

//(X)  =  "I"  +  c*;2(X  “  6)^  (2) 

where  denotes  max(2r,  0),  and  the  only  nonlinear  pa¬ 
rameter  is  a  scalar  0,  When  ^2  =  0,  the  signal  is  linear 
in  X,  and  minimizing  over  all  bro¬ 

ken  line  regressions  reduces  squared  error  by  more  than 
it  would  in  a  four  parameter  linear  model.  The  nonlin¬ 
ear  parameter  6  ‘‘uses  up”  approximately  2  degrees  of 
freedom,  according  to  simulations  in  Hinkley  (1969)  and 
asymptotics  in  Owen  (1991).  The  maximizing  value  0 
can  appear  anywhere  but  it  is  more  likely  that  spurious 
bends  will  appear  near  the  ends  of  the  observed  range  of 
X,s. 

The  questions  for  neural  networks  are: 

Qi  How  many  degrees  of  freedom  do  the  nonlinear  pa¬ 
rameters  in  (1)  use  up? 

Q2  Where  are  the  spurious  units  most  likely  to  appear? 
Q3  Which  units  if  any  are  less  prone  to  overfitting? 

3  One  Nonlinear  Unit 

To  examine  these  issues  we  consider  the  simplified  prob¬ 
lem  of  training  a  single  hidden  unit.  The  model 

^{X)^A{X)l3^u<i>{X,0)  (3) 

has  one  hidden  unit  to  train,  and  when  a;  =  0  that  one 
unit  is  redundant.  The  term  ^(X)/?  is  a  linear  model  in 
some  non-adaptive  basis  functions  A{X)  with  coefficients 
/?.  This  might  be  simply  a  constant,  or  a  linear  model 
in  X,  or  it  might  include  units  (jjj<l){Xj0j)  with  their 
nonlinear  parameters  0j  frozen  at  some  values  and  with 
Uj  subsumed  into  /?.  The  model  without  the  redundant 
unit  is: 

Mix)  =  A{X)I3  (4) 

Even  when  (4)  is  true,  the  sum  of  squared  errors  under 
(3)  will  be  smaller.  For  any  fixed  0  the  reduction  is 

S{e)  =  SSE^i)  -  SSE^3)i0)  ~  (5) 

The  result  in  (5)  is  exact  for  normally  distributed 

errors  and  is  an  asymptotic  approximation  otherwise. 
The  reduction  of  the  squared  error  of  model  (3)  over  (4) 
is 

S^snpSiO)  (6) 


and  S  does  not  have  a  Xi+d  distribution,  with  d  = 
dim(©),  as  one  might  have  expected  based  on  linear 
model  theory. 

It  is  convenient  to  define  a  signed  root  process  via 

Zie)  =  ±3(0)^^^  ~  NiO,  (T^)  (7) 

where  the  sign  of  3(9)  is  the  same  as  that  of  u  when 
fitting  (3)  with  9  fixed.  Let  Zmax  =  sup^g©  3(9).  For 
large  j,  >  0.  P(5  >  y)  =  2P(Z™ax  > 

Suprema  of  Gaussian  random  fields,  such  as  Z(0)^ 
have  been  well  studied.  At  any  0,  for  smooth  processes, 
Z  and  its  first  two  derivatives  have  a  joint  Gaussian  dis¬ 
tribution.  A  local  maximum  of  Z  above  Zq  is  a  point  0 
such  that  Z{0)  >  Zq,  the  gradient  of  Z  vanishes  at  0  and 
the  Hessian  of  Z  is  negative  definite  at  A  standard 
tail  approximation  is 

P{Zin2LX  >  Zocr)  =  £^(#Local  Maxima  >  Zqct) 

=  /  x{e)d0 

J& 

where  A(^)  is  the  intensity  of  high  local  maxima  of  Z 
near  0, 

For  one  dimensional  intervals  0  of  finite  length,  this 
formula  is  the  expected  number  of  “upcrossings”  of  the 
level  Zqo-  by  the  process  Z(6).  If  one  adds  the  prob¬ 
ability  that  Z  exceeds  Zo<r  at  one  end  of  0  one  gets 
Rice’s  formula  which  is  in  this  case  an  upper  bound  on 
P(Zmax  >  Zo<t),  For  stationary  fields,  this  formula  re¬ 
duces  to  the  volume  of  0  times  an  intensity  that  is  con¬ 
stant  in  0.  See  Adler  (1981,  Chapter  6).  The  formula 
above  is  taken  from  Aldous  (1989,  Chapter  J7).  This  for¬ 
mula  is  the  lead  term  in  the  more  accurate  but  more  diffi¬ 
cult  formulas  obtained  by  Siegmund  and  Knowles  (1989). 
The  more  accurate  formulas  take  more  care  around  the 
boundary  of  0. 

The  intensity  function  is 

A(0)  =  (27r)-(‘'+i)/2^o^-ig-ZoV2|^(0)|i/2 

where  |A|  denotes  the  determinant  of  A  and 

is  the  Hessian  of  the  correlation  matrix  of  Z{0)  evaluated 
at  ^0- 

Owen  (1993a,  Theorem  2)  gives  an  expression  for  the 
rs  entry  of  A(^).  Let  $  be  the  vector  of  n  values  <^(Xt,  0), 
let  <^r  be  the  vector  of  5<^(X,-,  0)ld9r)  and  let  M  be  the 
projection  matrix  on  the  space  spanned  by  the  matrix 
with  n  rows  given  by  A(X,).  Define  the  inner  product 


A.B.  Owen  59 


<  g,h  >=  g'{I  -  M)h  and  define  jr  =<  >  /  < 

>.  Then 

A„  =<  >  /  <  ^>^  >  •  W 

This  equation  may  be  better  understood  as  an  al¬ 
gorithm:  Construct  the  vectors  by  evaluating 

(i>{Xi,  0)  and  its  gradient  with  respect  to  9.  Then  replace 
them  by  their  residuals  after  fitting  linear  model  on  the 
predictors  A{X).  Then  find  the  partial  correlation  of  the 
resulting  and  variables  after  adjusting  for 

The  result  provides  Ar«  (0)-  Doing  this  for  all  r  and  s 
and  taking  the  determinant  allows  one  to  calculate  the 
intensity  A(0). 

Thus  for  one  nonlinear  unit,  we  have  a  way  to  approx¬ 
imately  answer  the  questions  raised  above: 

A1  Integrate  A  over  0  and  compare  with  chisquare  tail 
probabilities. 

A2  Maximize  or  plot  A  (or  over  0. 

A3  Compare  A  (or  for  different  activations 

4>{X,9). 

In  A2  and  A3  the  use  of  jAp^^  is  a  little  simpler  since 
unlike  A(^),  it  does  not  depend  on  Zq, 

4  Results  and  Examples 

The  intensity  function  \{9)  can  be  evaluated  either  nu¬ 
merically  or  theoretically.  Based  on  this,  one  can  find 
predictions  of  the  Poisson  clumping  heuristic: 

PI  Long  tailed  units  ^  lead  to  fewer  local  maxima  and 
use  fewer  degrees  of  freedom  in  noise. 

P2  Spurious  bent  planes  are  more  likely  near  the  convex 
hull  of  the  A”s. 

P3  Spurious  sigmoidal  units  are  more  likely  to  pass 
through  the  middle  of  the  X’s. 

P4  Spurious  radial  basis  units  are  more  likely  when  the 
radius  is  small. 

P5  Those  small  radius  units  are  likely  to  be  found  near 
voids  in  the  X’s. 

Since  the  method  works  by  estimating  the  expected 
number  of  high  local  maxima,  it  also  sheds  some  light  on 
which  types  of  units  are  likely  to  make  global  optimiza¬ 
tion  difficult. 

Figure  1  shows  216  predictors  X  G  for  a  synthetic 
data  set.  For  a  Gaussian  radial  basis  function  model 
<f>{X,  9)  =  exp(-||X  -  9f/2T'^)  in  (3).  With  this  model 


form,  9  is  in  the  same  space  as  the  X,-.  Figure  2  shows 
|A(^)|^^^  for  this  model  taking  r  =  0.5,  A{X)  =  1  and 
0-2  =  1.  We  use  =  1  in  all  examples  in  this  section. 

The  peak  of  A(0)  is  in  the  middle  of  the  X,-  set.  There 
is  a  second  peak  between  the  main  body  of  the  data  and 
a  small  cluster  near  (3, 1).  There  are  ridges  extending 
away  from  the  data  along  lines  equidistant  from  pairs  of 
points  on  the  convex  hull  of  the  Xj.  For  smaller  t  the 
function  A(0)  generally  increases  and  the  ridges  become 
very  high  and  sharp.  The  ridges  correspond  to  ^-regions 
in  which  small  changes  in  one  unit  can  explain  either  of 
two  potential  outliers,  or  perhaps  both  of  them,  if  they 
have  the  same  sign.  For  small  t  large  spikes  can  appear 
over  the  centers  of  gaps  in  the  point  cloud.  In  these 
locations  small  changes  in  0  can  make  big  changes  in 
what  the  unit  explains. 

In  order  to  plot  the  results  for  sigmoidal  units  and 
other  activations  which  are  functions  of  projections  of 
the  data,  we  turn  to  polar  coordinates.  For  9  =  (^i,  ^2)', 
let 

7r(Xi,  =  Xii  cos(^)  -h  Xi2sin(0i)  -  02- 

so  that  01  is  an  angle  and  02  is  a  radius.  Figure  3  shows 
X{0)  for  a  sigmoidal  radial  basis  unit 

(f>(Xi,0)  =  (1  +exp(-ir{Xi,9)/T))~^ . 

Here  r  =  0.5  and  A{X)  =  1.  The  points  in  the  plot  trace 
out  the  convex  hull  of  the  data  from  Figure  1.  That  is 
for  a  list  of  angles  ^ i ,  the  maximum  and  minimum  of 
Xii  cos(0i)  -t-  Xi2  sin(^i)  over  the  X.-  is  plotted.  Figure 
3  shows  that  spurious  sigmoids  are  more  likely  to  have 
their  linear  regions  passing  through  the  center  of  the 
data  than  near  the  convex  hull  of  the  data.  Decreasing 
T  makes  the  sigmoids  approach  “threshold”  units,  and 
this  generally  increases  |A|.  (With  threshold  units,  the 
process  Z{0)  is  not  smooth  enough  to  apply  Theorem 
2  of  Owen  (1993a),  but  the  Poisson  clumping  heuristic 
may  be  applied  in  another  form.) 

Figure  4  shows  1A(^)|^/^  for  crease  units  of  the  form 
<i>{,Xi,0)  =  ■!r(Xi,0)+.  Again  A{X)  =  1,  but  for  this  ac¬ 
tivation,  the  spurious  events  are  much  more  likely  near 
the  convex  hull  of  the  predictors.  Note  that  taking 
A(X)  =  (1,X')  makes  models  (4)  and  (3)  into  a  plane 
and  bent  plane  respectively. 

Figure  5  shows  A(^)  for  hyperbolic  fold  units  of  the 
form  <l>iXi,9)  =  7r(X.-,  0)/2  H-  (r^  +  ■K{Xi,0f  For 

Figure  5,  r  =  0.5.  Note  that  as  r  decreases  to  zero,  the 
hyperbolic  folds  become  bent  plane  creases. 

For  Figure  6  a  sigmoidal  unit  is  considered  with 
A(X)  =  (l,<^(X,0o))  where  0o  =  (7r/4,l).  That  is,  a 
second  sigmoidal  unit  is  being  trained  while  the  first  one 
is  held  with  it’s  angle  at  7r/4  and  it’s  radius  at  1.0.  The 
resulting  plot  of  |Ap^^  shows  that  the  second  unit  being 
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trained  has  a  tendency  to  be  close  to  the  first  one  be¬ 
ing  held  fixed.  This  suggests  that,  if  units  are  trained 
sequentially,  that  spurious  units  might  arise  near  units 
already  included  in  the  model.  This  behavior  arises  for 
radial  basis  unit  and  for  crease  units  too.  It  is  somewhat 
weaker  for  the  hyperbolic  folds. 

Owen  (1993b)  makes  some  simplifying  assumptions 
(large  n,  X  spherical  Gaussian  in  d  dimensions)  and  de¬ 
velops  an  approximation  of  the  form 

PiS  >  j/)  ~  >  y)  (9) 

for  units  <j>{X^6)  with  fixed  radius  ||^||.  In  this  case  the 
multiplier  6  depends  on  the  radius  and  of  course  on  the 
type  of  unit.  The  main  conclusion  is  that  short  tailed 
units  have  larger  values  of  5.  For  some  long  tailed  units 
the  resulting  6  is  close  to  one,  indicating  that,  for  such 
units,  redundancy  does  not  make  large  changes  to  the 
asymptotics.  Short  tailed  sigmoidal  units  are  ones  where 
the  distribution  function  corresponding  to  the  sigmoid 
used  has  short  tails.  For  example  the  Cauchy  distribu¬ 
tion  has  very  long  tails,  the  uniform  distribution  function 
has  very  short  tails  and  the  widely  used  logistic  sigmoids 
have  tail  lengths  between  these  extremes.  Fold  units  that 
approximate  creases  are  defined  through  the  integral  of 
a  sigmoidal  function.  The  fold  has  short  or  long  tails 
according  to  whether  the  sigmoid  does. 

5  Many  Units 

It  is  possible  to  extend  this  method  to  problems  with 
many  units,  though  it  is  harder  to  find  simple  descrip¬ 
tions  of  the  results.  Suppose  that  for  j  =  1, . . . ,  J  we 
have  Oj  G  0; .  Let  ©o  be  the  unit  hemisphere  in  J  dimen¬ 
sions,  with  a  positive  /'th  component.  Let  (^oi ,  •  •  * , 
be  a  point  in  ©o-  Then  we  may  write  (1)  as 

J 

fi(x)  =  u>o  +  uY^eoj<i>{x,ej)  (10) 

i=i 

=  ojQ-\-ijO(p{X,d)  (11) 

where  ??  G  ©o  x  ©i  x  ’  *  *  x  ©/  subsumes  all  the  nonlinear 
parameters  0j  and  all  but  one  degree  of  freedom  of  ui 
through  a;j  and  y?  is  a  nonlinear  function  of  X. 

Sun  (1989)  uses  this  construction  in  studying  p  values 
for  projection  pursuit  regression. 

Figure  Captions 

Figure  1  Shown  are  216  points  X,-  6  R^.  These  are 
a  synthetic  data  set  of  predictors. 


Figure  2  The  points  are  those  of  Figure  1.  The 
contours  are  those  of  for  a  Gaussian  radial  basis 
function  with  radius  r  =  0.5. 

Figure  3  The  contours  are  those  of  in  polar 

coordinates,  for  a  sigmoidal  unit  with  inverse  slope  r  = 
0.5.  The  points  describe  the  convex  hull  of  the  data  set 
in  Figure  1. 

Figure  4  The  contours  are  those  of  (A in  polar 
coordinates,  for  a  crease  (bent-plane)  unit.  The  points 
describe  the  convex  hull  of  the  data  set  in  Figure  1. 

Figure  5  The  contours  are  those  of  |A^/^|,  in  polar 
coordinates,  for  a  hyperbolic  unit  with  inverse  slope  r  = 
0.5.  The  points  describe  the  convex  hull  of  the  data  set 
in  Figure  1. 

Figure  6  The  contours  are  those  of  |A^/^|,  in  polar 
coordinates,  for  a  sigmoidal  unit  with  inverse  slope  r  = 
0.5.  Another  sigmoidal  unit,  with  nonlinear  parameter 
frozen  at  ^  =  (^/4, 1*0)  is  included  in  the  linear  portion 
of  the  model.  The  points  describe  the  convex  hull  of  the 
data  set  in  Figure  1. 
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Abstract 

The  use  of  likelihood  profiles  for  exploring  and  mea¬ 
suring  non-identifiabiliy  and  near  non-identifiability  is 
discussed.  The  method  is  then  applied  to  the  estima¬ 
tion  of  normal-gamma  stochastic  frontier  models  used 
in  econometrics.  It  is  shown  that  these  models  are 
practically  non-identifiable  for  samples  sizes  up  to  sev¬ 
eral  hundreds  of  observations. 

Keywords:  Frontier  models 

1.  Introduction 

This  paper  deals  with  the  following  problem  fre¬ 
quently  encountered  in  practice.  A  standard  paramet¬ 
ric  model  exists  for  a  certain  type  of  data  sets,  but  the 
researcher  has  the  impression  that  the  choice  of  this 
model  is  somewhat  arbitrary  and  that  a  more  flexi¬ 
ble  extension  might  be  more  appropriate.  The  natural 
move  is  to  add  a  parameter  to  increase  flexibility  and 
to  estimate  this  parameter  together  with  the  quanti¬ 
ties  one  is  interested  in  from  the  data.  Unfortunately, 
this  can  easily  turn  a  well-posed  problem  into  a  non- 
identifiable  or  nearly  non-identifiable  one.  Likelihood 
profiles  can  be  used  to  explore  such  situations. 

The  tool,  profiling,  is  not  new  and  ample  literature 
exists  on  various  of  its  aspects.  However,  except  for  the 
work  of  Bates  and  Watts  (1988)  authors  have  mostly 
concentrated  on  the  properties  of  profiles  in  the  con¬ 
text  of  elimination  of  nuisance  parameters  (Barndorff- 
Nielsen,  1983;  Barndorff-Nielsen,  1986)  and  less  on 
their  value  for  the  purpose  of  exploration  (Ritter  and 
Bates,  1993). 

The  paper  begins  with  an  introduction  of  the  nota¬ 
tion  of  the  problem  and  of  the  terminology  of  like¬ 
lihood  profiles.  In  Section  3,  a  concrete  problem, 
the  estimation  of  normal-exponential  and  normal- 
gamma  stochastic  frontier  models  (Aigner,  Lovell  and 
Schmidt,  1977;  Stevenson,  1980;  Meeusen  and  van  den 
Broeck,  1977;  Greene,  1990)  is  described.  In  Section 
4,  a  strategy  for  using  likelihood  profiles  to  study  this 
problem  is  laid  out.  In  Section  5,  the  results  of  a  sim¬ 
ulation  are  reported.  The  paper  is  concluded  by  a 
discussion  of  the  results. 


2.  Notations  and  Terminology 

We  suppose  that  data  are  generated  as  continuous 
random  variables  from  a  parametric  model  as 

X,*~F0(x);  0  6  0  C  R*.  (2.1) 

where  ©  is  a  nice  connected  domain,  and  where  F 
has  density  /  which  is  twice  continuously  differen¬ 
tiable  in  6,  The  corresponding  likelihood  is  denoted 
by  L{6\k)  =  /^(x)  and  the  log-likelihood  by  /(0|x). 

Moreover,  we  assume  that  inference  is  conducted 
by  maximum  likelihood.  That  is,  for  a  sample  x  = 
(a?!, aJn)  the  point-estimate  of  6  is  obtained  by 

6  =  argmaxgL(0|x),  (2.2) 

and  confidence  regions  are  computed  by  either  using 
the  inverse  information  matrix 

t=[DllogLi0\x)\^^^Y^  (2.3) 

or  the  approximation  of  the  log-likelihood. 

If  we  are  worried  that  the  model  might  not  be  suf¬ 
ficiently  flexible,  we  can  try  to  find  an  extension  by 
incorporating  an  additional  parameter  ip.  We  denote 
the  likelihood  after  adding  as  L{9,  iplx).  Frequently, 
the  original  model  corresponds  to  a  particular  choice  of 
tp  ipQ  for  which  L{0yipo\'x.)  oc  L{9\x).  If  1/(0,  ^|x)  is 
smooth  in  the  joint  parameter  vector  and  if  ipo  is  in  the 
interior  of  the  domain  of  the  usual  likelihood-ratio 
test  can  be  used  to  check  whether  the  data  require  the 
extended  model  or  not. 

Frequently,  however,  maximum  likelihood  estimates 
are  much  harder  to  find  for  the  extended  model  than 
for  the  original  one.  The  information  contained  in 
common  finite  samples  may  not  sufl&ce  to  pin  down  ^ 
and  the  estimates  of  the  components  of  9  may  strongly 
depend  on  ip.  In  this  situation,  virtually  all  precision 
in  the  estimation  of  9  is  lost  by  going  from  the  original 
to  the  extended  model.  That  is,  the  extended  model 
becomes  practically  non-identifiable. 

In  order  to  assess  how  adding  ip  to  the  model  affects 
the  estimation  of  9  we  can  try  to  compute  the  profile 
trace 

=  arginaxgL(0,  V’|x)  (2.4) 
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and  the  profile  value  of  the  log-likelihood 

l{f)  =  max/(0,  V-lx)  =  V>|x).  (2.5) 

u 

If  a  joint  maximum  likelihood  estimate  (0,  ^),  exists, 
we  can  carry  this  out  by  re-maximizing  the  likelihood 
for  discrete  values  of  ^  starting  at  ijj  and  gradually 
moving  outAvard.  This  assures  that  good  starting  val¬ 
ues  for  the  re-maximization  are  always  available.  Al¬ 
ternatively,  if  no  joint  maximum  can  be  found  but  if 
the  original  model  is  a  special  case  of  the  extended 
model  at  one  can  start  with  the  estimates  of  the 
original  model  and  move  gradually  away  from  ^o*  If 
no  obvious  starting  point  is  available,  a  grid  of  ^  val¬ 
ues  has  to  be  laid  out  and  the  conditional  optimiza¬ 
tions  have  to  be  attempted  directly.  Once  the  profile 
trace  and  the  profile  values  have  been  computed  for  a 
sufficiently  far  reaching  and  fine  grid  of  ij)  values,  inter¬ 
mediate  values  can  be  obtained  by  spline  interpolation. 
The  existence  of  profiles  can  only  be  guaranteed  un¬ 
der  severe  regularity  conditions  and  the  reader  should 
keep  in  mind  that  computing  profiles  is  an  exploratory 
technique  which  will  work  in  many  but  not  all  situa¬ 
tions. 


In  the  following  discussion,  we  denote  the  variance 
of  the  normally  distributed  noise  component  Ui  by  <7^, 
the  scale  parameter  of  the  exponential  or  gamma  inef¬ 
ficiencies  by  A,  and  the  shape  parameter  of  the  gamma 
distribution  by  a.  We  assume  that  the  z,-  and  the  Ui 
are  all  independent. 

4.  Analyzing  the  Normal- Gamma 
Model  by  Likelihood  Profiles 


3.  A  Stochastic  Frontier  Model 

A  typical  case  where  practical  non-identifiability  is 
observed  is  the  transition  from  a  normal-exponential  to 
a  normal-gamma  stochastic  frontier  model  for  econo¬ 
metric  data.  Such  frontier  model  have  the  structure 

y;-  =  /i  -f  Xil3  -  Zi  +  Ui,  (3.1) 

where  Yi  represents  the  observed  output  (passenger 
miles  for  airlines,  for  example)  and  jjl  -h  the  op¬ 
timal  output  which  can  be  obtained  from  the  vector 
of  inputs  Xi  =  (it’t;!, -.5  2;«;p)  (which  could  be  labor, 
capital,  fuel,  etc.).  The  parameters  fi  and  are  un¬ 
known  and  have  to  be  estimated  from  the  data.  The 
two  error  terms  Zi  and  Ui  represent  the  inefficiency  of 
unit  i  and  the  measurement  error.  The  component  z; 
is  restricted  to  be  positive,  while  i/i  is  usually  treated 
as  normally  distributed  with  an  unknown  variance 
There  are  several  choices  for  a  distribution  of  the  z,-. 
Good  estimation  properties  can  be  obtained  using  an 
exponential  or  a  half-normal  distribution.  The  disad¬ 
vantage  of  these  choices  are  that  the  shape  of  the  dis¬ 
tribution  of  the  inefficiencies  is  imposed  without  sci¬ 
entific  reason.  On  can  avoid  such  a  hard  choice  by  us¬ 
ing  a  gamma  distribution  for  the  inefficiencies  instead 
(Greene,  1990).  Gamma  distributions  are  very  flexible 
and  contain  the  exponential  distribution  as  a  special 
case  when  the  shape  parameter  is  equal  to  one.  Un¬ 
fortunately,  maximum  likelihood  estimation  is  much 
more  difficult  for  the  normal-gamma  model  than  for 
the  normal-exponential  model. 


Figure  1:  Profile  values  for  the  normal-gamma  model 
of  the  American  Electric  Utilities  (Greene,  1990)].  The 
evaluated  points  are  joined  by  an  interpolating  spline. 


The  transition  from  the  normal-exponential  to  the 
normal-gamma  stochastic  frontier  model  is  a  show-case 
for  the  use  of  likelihood  profiles.  Suppose  that  the  like¬ 
lihood  of  the  normal-gamma  model  has  been  optimized 
for  fixed  values  ai  <  a2  <  ♦  •  •  <  ap  covering  a  range 
from  distributions  more  extreme  than  the  exponen¬ 
tial  (i.e.,  a  <  1)  to  distributions  close  to  normal  (i.e., 
Of  »  1)  and  that  the  corresponding  profile  trace  and 
the  profile  values  of  the  log-likelihood  are  ^1,  ^2,  •••,  Op 
and  /i,/2,-..,/p  (here  6  denotes  the  combined  param¬ 
eter  vector  (//,/3',(r^,  A)).  Suppose  also  that  the  joint 
maximum  likelihood  estimate  (0,  d)  was  found  and  is 
among  those  values.  By  the  approximation  of  the 
likelihood  ratio  statistic  we  obtain 


(4.1) 


This  enables  us  to  define  likelihood  intervals  for 

a  with  approximate  1  —  w  coverage  by 

=  I  a  I  /(a)  >  l{d,  a)  -  ^x\0-  “  w)  |  •  (4.2) 

In  practice,  for  a  coverage  probability  of  95%,  we  can 
plot  the  profile  values  2/^  versus  the  a,-  and  draw  a  line 
Xi(.95)^  te  3.84  below  the  observed  meiximum  21.  The 
range  of  a  values  corresponding  to  points  above  the 
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Figure  2:  Medians  of  2(/j  (a*)  -  ^O;;)  for  each  combination  of  n  and  /?.  The  abscissa  is  on  a  logarithmic  scale  and 
the  evaluated  points  are  joined  by  interpolating  splines. 


line  provides  an  approximate  confidence  interval  for  a 
and  also  a  simple  graphical  means  for  judging  whether 
a  is  well-determined  by  the  data.  Figure  1  shows  such 
a  plot  for  a  normal-gamma  model  of  the  eflSciencies  of 
American  Electrical  Utilities  analyzed  by  Christensen 
and  Greene  (1976)  and  Greene  (1990). 

We  see  that  2/  exceeds  considerably  the  lower  line 
for  all  chosen  values  of  a.  None  of  those  values  is 
therefore  rejected  by  the  likelihood  ratio  test.  This 
indicates  that  the  data  (123  records)  do  not  contain 
suflScient  information  to  tie  down  a.  Ritter  and  Simar 
(1993)  show  that  the  imprecision  in  the  estimation  of 
ot  carries  over  to  the  quantities  of  econometric  interest. 

5.  Simulation  of  Special  Cases 

In  this  section,  we  use  simulations  from  a  specific  but 
typical  normal-gamma  model  to  show  how  the  sample 
size  and  the  share  of  the  total  variance  attributed  to 
the  noise  component  Ui  affect  the  estimation  properties 
of  a. 

The  special  case  considered  here  is  the  normal- 
gamma  model 

y;-  =  /i  -  zi  -f  i/i  (5.1) 

with  frontier  /x  =  0  and  shape  parameter  a  =  2.  The 
choice  of  the  shape  parameter  corresponds  to  a  distri¬ 


bution  which  is  clearly  not  exponential,  but  still  far 
from  normal.  The  parameters  characterizing  the  esti¬ 
mation  properties  are  the  sample  size  n  and  the  ratio 
p  =  £r^/(aA^  -f-  cr^),  the  proportion  of  ^‘noise”  in  the 
total  variance.  For  example,  the  choice  p  =  1/3  im¬ 
plies  that  1/3  of  the  total  variability  comes  from  the 
noise  component  and  2/3  from  the  inefiSciencies.  An 
allocation  of  1/5  to  1/2  of  the  total  variance  to  the 
noise  component  is  typical  and  has  for  example  been 
observed  with  the  the  American  Electric  Utility  data. 

For  any  choice  of  n  and  p  data  sets  can  be  simulated. 
These  data  sets  can  then  be  analyzed  by  maximum 
likelihood  and,  in  particular,  their  profile  traces  and 
values  can  be  computed  with  respect  to  a. 

Recall  that  the  true  parameters  are  known  and  thus 
provide  the  likelihood /q  =  /(0,  a|x).  For  fixed  a  =  2, 
the  profile  value  h  =  /(2)  relates  to  Iq  via  the  approx¬ 
imation  2{l2  —  /o)  «  Xs  with  a  distribution  with 
three  degrees  of  freedom.  For  each  simulated  data  set, 
the  true  likelihood  /q  can  be  computed  and  used  to  off¬ 
set  the  profile  values  /(«»)  thus  allowing  comparisons 
of  the  profile  values  across  data  sets. 

For  example,  =  2(^(Q:t)  —  /q;;)  can  be  computed 
for  simulated  data  sets  j  =  and  summaries, 

such  as  medians  and  quartiles  can  be  retained  for  each 
position  a*.  A  graphical  superposition  of  the  medians 
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for  each  of  the  settings  above  can  provide  a  convenient 
assessment  of  the  estimation  properties  of  a. 

6.  Results  of  The  Simulation  Study 


n 

p  —  l{p^  +  ckA^) 

1/9  1/5  1/3  1/2 

100 

x  X 

200 

X 

400 

XXX 

800 

X 

Table  1:  Scenarios  for  simulation. 

For  each  scenario  indicated  in  Table  1,  60  data  sets 
were  simulated  for  each  of  them  the  profile  of  the 
log-likelihood  with  respect  to  a  was  computed.  Fig¬ 
ure  2  shows  the  medians  of  2[lj  (a^)  —  /qj  )  for  each  sce¬ 
nario.  The  medians  for  the  same  scenario  are  joined  by 
smooth  curves.  The  solid  line  represents  the  expected 
value  of  the  median  of  a  xi  distribution  with  three 
degrees  of  freedom;  the  dashed  line  Xi(0-95)  =  3.84 
units  below  denotes  the  cutoff  corresponding  to  a  95% 
confidence  likelihood  region.  As  we  expect  the  ob¬ 
served  medians  for  a  =  2  are  close  to  the  theoretical 
median  of  the  xl  distribution.  Moreover,  all  points 
except  for  a  —  0.5  of  the  cases  (800, 1/3),  (400, 1/9), 
and  (400, 1/5)  lie  above  the  dashed  line.  This  suggests 
that  the  estimation  of  a  is  very  poor  when  the  sample 
size  is  small  and  when  there  is  a  considerable  amount 
of  noise. 

7.  Discussion 

Profiles  can  provide  convenient  tools  for  explor¬ 
ing  likelihoods  in  situations  of  near  non-identifiability. 
The  results  can  be  displayed  using  simple  graphics  and 
are  easy  to  interpret.  In  the  context  of  normal-gamma 
stochastic  frontier  models,  this  approach  yielded  the 
insight  that  in  general  large  sample  sizes  are  needed 
to  estimate  a  well.  Sample  sizes  of  100  or  200  ob¬ 
servations,  which  are  common  in  practice,  are  clearly 
insufficient. 

Gradually,  profiling  algorithms  are  finding  their  way 
into  standard  statistical  software  packages.  Explicit 
profiling  algorithms  are  already  available  in  S  and 
Spins.  In  other  packages,  profiling  is  an  implicit  in¬ 
gredient  in  procedures  for  Bayesian  inference.  This  is 
the  case  in  Xlispstat,  where  the  procedure  for  comput¬ 
ing  the  Laplacian  approximation  of  a  marginal  relies 
on  the  computation  of  a  profile. 
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ABSTRACT 

The  evaluation  of  statistical  procedures  in  the 
area  of  finance  requires  powerful  and  rich  com¬ 
puter  environments.  Requirements  for  such 
environments  are  stated  and  their  need  illus¬ 
trated  with  the  example  of  geometric  Brown¬ 
ian  motion. 

1  Introduction 

Computer  intensive  methods  and  the  inter¬ 
face  between  statistics  and  computing  seem  to 
carry  nowdays  a  specific  and  rather  restrictive 
meaning,  that  of  single  statistical  techniques 
or  methods  which  rely  heavily  on  the  computer 
for  implementation.  Thus  LMS  (Least  Median 
of  Squares)  regression  [13]  requires  that  many 
systems  of  linear  equations  be  solved.  Many 
problems  in  statistics  have  their  source  outside 
of  it  and  require  that  broad  arrays  of  math¬ 
ematical,  statistical  and  numerical  techniques 
be  put  to  bear  on  sizeable  areas  of  a  particular 
discipline  or  set  of  such.  The  discipline  consid¬ 
ered  here  for  illustration  is  that  area  of  finance 
which  deals  with  contingent  claims  [4]  and, 
in  it,  the  simplest  model,  geometric  Brown¬ 
ian  motion  (GBM  henceforth),  shall  be  chosen. 

^Research  support  is  provided  by  the  Swiss  Na¬ 
tional  Science  Foundation,  Contrat  No.  12-36209.92 


The  computer  intensive  aspect  of  this  prob¬ 
lem  area  is  due  to  two  basic  factors.  The  first 
is  the  high  number  of  mathematical,  statis¬ 
tical  and  programming  techniques  that  must 
be  marshalled  to  progress  towards  a  solution. 
The  second  is  the  limitation  inherent  in  all  an¬ 
alytical  developments  when  numerical  answers 
are  required:  one  must,  in  finCy  resort  to  sim¬ 
ulations.  In  such  situations,  significant  “prac¬ 
tical”  progress  towards  workable  solutions  is 
often  dependent  on  the  quality  of  the  “infor¬ 
mation  system”  which  is  available  and  which 
always  must  encompass  much  more  than  a  set 
of  statistical  tools,  even  when  packaged  into 
an  organic  whole.  Such  considerations  justify 
the  second  part  of  the  title  which  is  borrowed 
from  the  CASE  (Computer  Aided  Systems  En¬ 
gineering)  technology  discourse:  it  sees  solving 
a  problem  (building  an  information  system)  as 
a  set  of  tasks  which  are  grouped  into  activ¬ 
ities  which  constitute  processes.  These  yield 
in  turn  the  solution  (the  information  system). 
In  that  world  tasks  require  tools,  activities, 
workbenches,  and  processes,  environments  [6]. 
It  is  claimed  that  there  is  a  need  for  analogous 
concerns  and  means  in  the  area  of  computer 
intensive  methods  of  interest  here  and  a  set  of 
“minimal”  requirements  is  given  that  would 
provide  an  adequate  environment  for  the  pur¬ 
suit  of  such  problems.  Similar  needs  arise  in 
certain  areas  of  engineering  [1]:  the  difference 
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is  mostly  with  the  type  of  mathematical  mod- 
els  that  are  used. 


2  An  example:  Statistical 
fit  of  GBM 


GBM  is  of  interest  in  the  financial  area  be¬ 
cause  it  is  intimately  linked  to  the  Black- 
Scholes  formula,  a  formula  that  allows  pric¬ 
ing  of  an  option  [4].  To  actually  use  the  for¬ 
mula  one  must  obtain  an  estimate  of  a  param¬ 
eter  which  is  the  diffusion  parameter  of  the 
GBM  which  describes  the  behaviour  of  the  as¬ 
set  supporting  the  option.  A  GBM  St  is  de¬ 
scribed  [10]  implicitly  by  the  stochastic  differ¬ 
ential  equation 

dSt  =  -h  (fStdWt, 

where  is  a  standard  Wiener  process,  and 
explicitly  by  the  expression 

The  statistical  problem  consists  in  estimating 
jj,  and  cr  from  the  observation  of  a  path  f{i)  of 
the  process  S'  at  a  finite  number  of  time  points 

<1  =  0  <  fi  <  •  •  •  <  in  =  T, 


usually  limited  to  the  method  presented,  and 
which  avoid  comparisons  with  other  methods. 
No  systematic  statistical  investigations  exist, 
which  is  easily  understood,  given  the  complex¬ 
ity  of  the  estimators  considered. 

One  possible  method  of  estimation  of  /i  and 
cr  consists  in  computing  the  Radon-Nikodym 
derivative  of  the  law  of  S  with  respect  to  that 
of  5o  +  (rWy  and  of  deriving  from  it  estima¬ 
tors  which  are  then  ‘‘discretized  at  the  obser¬ 
vations”  [2].  One  gets 


'2 _ 

- 


Tki  i 


These  estimators  are,  in  general,  sums  of  in¬ 
dependent,  non  identically  distributed  random 
varaibles  whose  law  is  not  exactly  express¬ 
ible  analytically.  So  typically,  one  must  com¬ 
pute  moments  and  derive  an  asymptotic  re¬ 
sult.  Such  calculations  require  that  high  order 
moments  (order  six  in  this  case)  be  evaluated. 
To  that  end  one  introduces  expressions  of  the 
form 


^^kNi 


A  number  of  estimators  are  available  [2,  3,  7, 
12],  but  there  seems  to  be  little  comparative 
work  in  settings  which  are  “realistic”  (as  de¬ 
scribed  in  [4]  for  example),  which  means  in 
particular  that  the  number  of  observations  is 
small  (between  50  and  200  is  “typical”),  and 
often  that  T  =  n.  These  constraints  raise  a 
number  of  questions  for  which  there  are  few 
analytical  answers  (it  should  be  stressed  that 
the  case  of  GBM  is  almost  the  simplest  one 
could  conceive).  The  recourse  is  thus  simu¬ 
lations.  Most  methods  known  so  far,  and  in 
particular  those  mentioned  here,  are,  at  best, 
supported  by  partial  simulations  which  are 


where 

5(0  = 

and  Ni  is  a  normal  random  variable  with  pa¬ 
rameters  fii  =  {/J>  —  ^)(ti  —  ti^i)  and  af  = 

2 

=  //,•  +  ^).  A  typical  expres¬ 
sion  is  then 

n^QvJ  =  ^[(54,6-54,2)- 

4  (53, 3  -S3, 0-1-4  (52.1-52,0)] 

Here  is  a  list  of  what  a  systematic  simulation 
should  yield  to  allow  evaluation  of  such  esti¬ 
mators.  First,  one  should  distinguish  the  case 
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of  an  exact  model  and  that  of  a  model  which 
is  approximate.  For  an  exact  model,  at  least 
the  following  questions  should  be  answered: 

•  Does  the  value  of  the  parameter  to  be  es¬ 
timated  influence  the  quality  of  the  esti¬ 
mator? 

It  would  indeed  not  be  surprising  if  very 
small  or  very  large  values  of  the  parame¬ 
ters  to  be  estimated  would  influence,  pos¬ 
itively  or  negatively,  the  quality  of  some 
of  the  estimators  to  be  considered. 

•  What  IS  the  influence  of  the  number  of 
observations  on  the  quality  of  the  estima¬ 
tors? 

What  is  meant  by  "number  of  observa¬ 
tions”  can  be  many  sided:  it  may  be  the 
absolute  number  of  observations,  but  it 
also  may  be  the  density  of  observations 
(absolute  number  over  time  observed,  or 
number  per  unit  of  time).  In  [12]  it  means 
four  strongly  typed  observations  per  day: 
the  question  then  becomes,  how  many 
days? 

The  question  may  also  depend  on  the  type 
of  statistical  result  expected:  the  number 
of  observations  required  to  obtain  a  good 
estimator  may  be  less  than  that  necessary 
to  a  validation  of  the  fit.  If  one’s  only 
recourse  is  a  central  limit  result,  when 
(in  terms  of  absolute  numbers  or  density) 
does  this  limit  effect  take  place? 

One  may  finally  ask  for  "optimal”  combi¬ 
nations  to  insure  "overall  quality” ,  such 
as  absolute  number  together  with  a  given 
duration. 

•  Does  the  regularity  of  observations  mat¬ 
ter? 

Does  one  need  observations  taken  at  reg¬ 
ular  times,  or  axe  observations  registered 


when  possible  sufficient?  In  the  latter 
case,  is  there  a  “minimum  time  interval” 
beyond  which  estimators  become  useless? 

•  Are  there  better  methods  of  estimation? 

In  other  words  can  one  produce  prescrip¬ 
tions  for  estimation  which  ensure  "qual¬ 
ity”  of  the  results? 

•  What  is  the  law  of  the  price  of  the  option? 
Is  it  sensitive  to  the  estimation  procedure, 
or  to  any  of  the  potentially  disrupting  fac¬ 
tors? 

It  should  be  clear  that  one  would  need  in 
practice  some  kind  of  confidence  interval 
for  the  price! 

In  case  of  a  process  which  does  not  behave 
according  to  the  model,  a  number  of  obvious 
questions  come  to  mind.  Here  are  a  few: 

•  Are  the  estimators  robust? 

One  could  ask  for  the  kind  of  robustness 
which  is  expected:  the  really  important 
one  would  seem  to  be  that  of  the  law  of 
the  price!  An  associated  question  would 
be:  are  the  validation  procedures  suflSi- 
cient  to  at  least  alert  the  user  to  a  “depar¬ 
ture”  from  the  model,  such  as  a  process 
with  sample  paths  which  could  be  pro¬ 
duced  by  geometric  Brownian  motion,  but 
which,  in  reality,  are  not? 

•  Are  there  procedures  which  could  be  used 
to  detect,  or  to  adapt  the  statistical  proce¬ 
dures  to,  a  change  in  the  model? 

The  simplest  case  would  be,  for  geometric 
Brownian  motion,  a  change  in  the  values 
of  the  drift  and  the  diffusion  parameters. 
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3  A  wish  list  of  compo¬ 
nents  for  an  adequate 
environment 

Evaluation  of  statistical  methods  in  finance 
should  be  performed  as  a  two  stage  procedure: 
during  the  first,  one  would  only  be  concerned 
with  the  purely  statistical  performance  of  the 
method,  that  is  one  would  want  to  make  sure 
the  method  is  statistically  sound.  During  the 
second  stage,  one  would  want  to  check  that 
the  method  works  well  for  the  financial  ana¬ 
lyst  (not  the  statistician).  The  latter  requires 
that  one  has  access  to  databases  with  financial 
information,  and  that  a  prerequired  set  of  sta¬ 
tistical  operations  be  performed.  A  flexible  en¬ 
vironment  would  accept  commands  which  list 
the  operations  and  the  data,  and  carry  out 
the  retrievals  and  the  computations.  This  is  a 
purely  a  technical  matter  for  a  computer  ex¬ 
pert.  Only  the  first  stage  is  of  interest  here. 

In  the  chosen  example,  there  are.  many  ways  to 
estimate  the  parameters  and  a  number  of  “di¬ 
mensions”  according  to  which  the  evaluation 
of  these  estimates  should  be  carried  out.  The 
dimensions  correspond  to  the  questions  raised 
in  section  2.  Practically  one  carries  out  the 
simulation  as  follows. 

One  begins  with  simulations  of  the  process 
(GBM  here).  To  that  end  one  must  have  at 
least  two  tools:  an  “augmented”  random  “ob¬ 
jects”  generator  and  tools  to  manage  the  re¬ 
sults  of  the  simulations.  Traditional  random 
“objects”  generators  simulate  “objects”  whose 
complexity  is  that  of  a  random  variable  (ran¬ 
dom  numbers  generators).  For  finance  one 
must  be  able  to  simulate  well  at  least  paths 
of  diffusions  with  state  spaces  strictly  smaller 
than  the  real  line  (assets  typically  do  not  have 
negative  values).  As  shown  with  the  expert 
systems  ADAGIO  and  PRESTO  [^](PRESTO 
is  an  expert-system  which  performs  automatic 


generation  of  complete  Fortran  programs  solv¬ 
ing  Stochastic  Differential  Systems,  from  data 
provided  by  a  user  supposed  to  have  no  pre¬ 
requisite  knowledge  either  in  Numerical  Analy¬ 
sis  of  these  systems,  nor  in  programmation),  a 
useful  generator  must  be  coupled  with  an  “AI 
language”  (Lisp  in  PRESTO)  and  a  symbolic 
manipulator  (REDUCE  in  PRESTO).  In  fact, 
it  would  be  extremely  useful  to  have,  among 
the  capacities  provided  by  the  symbolic  ma¬ 
nipulator,  facilities  which  automate  stochastic 
calculus,  in  the  spirit  of  [8].  Furthermore,  the 
simulator  should  come  with  “automatic”  tools 
to  check  the  quality  of  the  paths  (if  an  estimate 
is  computed  on  a  path,  one  must  make  sure 
that  what  is  observed  is  the  behavior  of  the 
estimator,  and  not  the  behaviour  of  the  simu¬ 
lated  path).  “Exhaustive”  simulation  of  paths 
of  stochastic  processes  requires  on  the  other 
hand  that  one  benefits  from  facilities  to  man¬ 
age  the  versions,  such  as  “semi-automatic”  la¬ 
beling  of  files,  recording  of  seeds,  and  so  forth. 
One  should  then  be  able  to  browse  “easily” 
through  these  simulations. 

Once  the  paths  are  available,  one  needs  a 
“sampler”  for  at  least  two  purposes.  It  has 
been  argued  that  time  is  an  important  ele¬ 
ment  for  the  statistics  of  financial  models.  One 
should  thus  be  able  to  test  the  potential  es¬ 
timates  against  the  possible  time  dimensions 
as  described  above.  But  also  some  estimation 
procedures  may  require  specific  time  sampling. 
For  example,  the  estimation  procedure  inves¬ 
tigated  in  [12]  requires  the  first  and  last  daily 
values  of  the  asset,  as  well  as  the  largest  and 
the  smallest  during  the  day.  Thus,  to  extract 
from  a  simulated  path  different  types  of  sam¬ 
ples  and  associated  caracteristics  should  be  an 
easy  operation.  Finally,  since  financial  data 
is  “historical”  data,  the  basic  assessment  tech¬ 
nique  will  eventually  be  the  bootstrap  [5]  or  an 
adaptation  of  it  (it  is  thus  necessary  to  pro¬ 
duce,  from  a  sampled  path,  the  law  of  a  so 
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that  the  law  of  the  price  may  be  exhibited). 

In  the  area  of  diffusions  many  complex  objects, 
such  as  stochastic  integrals,  require  numeri¬ 
cal  approximations  and  it  would  be  useful  to 
have  those  pre-programmed  with  quality  algo¬ 
rithms  as  it  seems  clear  that  numerical  quality 
is  essential  for  the  successful  implementation 
of  these  rather  complicated  procedures.  Fur¬ 
thermore  certain  estimation  techniques  such 
as  filtering  [11]  ultimately  require  that  numer¬ 
ical  schemes  for  ordinary  differential  equations 
be  used. 

Of  course  a  large  array  of  “ordinary”  statisti¬ 
cal  techniques  should  be  available  (for  density 
estimation,  for  example).  These,  as  hinted  in 
section  2,  also  require  a  symbolic  calculator  to 
calculate  moments  explicitly  (see  the  formula 
for  the  variance),  and  other  similar  calcula¬ 
tions.  The  user  of  the  system  should  have  fa¬ 
cilities  to  enrich  and  complete  it  with  his  or  her 
favourite  techniques  (access  to  programming 
languages  and  expert  systems  shells).  Reports 
should  be  easy  to  produce  (integration  of  facil¬ 
ities  for  “intelligent”  graphic  presentations). 

At  the  present  time,  tools  are  available.  One 
needs  workbenches  and  environments! 
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Abstract:  The  distribution  of  independent  Bernoulli  trials 
is  investigated  in  the  case  where  the  probability  of  success 
is  different  at  each  trial.  Expressions  for  the  factorial 
moments  and  cumulants  are  given.  These  expressions  are 
used  to  construct  the  closed  form  of  the  probability  mass 
function.  The  method  is  shown  to  be  ill-conditioned  and 
Tikhonov  regularization  is  used  to  compute  the  probabilities. 
Formulas  for  the  cumulants  and  moments  are  also  developed 
and  the  probability  function  approximated  by  an  expansion 
in  the  orthogonal  polynomials  associated  with  a  Binomial 
distribution. 

1.  Introduction:  In  this  report,  three  methods  for  the 
computation  of  the  probability  mass  function  of  the  random 
variable  X,  which  counts  the  number  of  successes  in  N 
independait  trials,  will  be  considered.  In  section  2,  a  direct 
approach  based  on  exhaustive  enumeration  will  be 
considered.  It  will  be  shovra  to  be  impractical  in  all  but  the 
simplest  cases.  In  section  3,  formulas  for  the  factorial 
moments  are  given.  These  are  used  with  the  formula  of 
Laurent  [4]  to  give  a  closed  form  representation  of  the 
probability  mass  function.  It  is  shown  that  this  approach  is 
very  ill-conditioned,  but  that  good  results  can  be  obtained  by 
Tikhonov  regularization.  In  section  4,  an  alternative 
approach  based  on  using  moments  to  approximate  the 
probability  mass  function  by  an  expansion  in  orthogonal 
polynomials  is  presented.  It  is  found  that  this  approach 
becomes  ill-conditioned  as  higher  moments  are  used,  but  that 
it  gives  good  results  in  general.  Finally,  in  section  5,  a  table 
of  results  is  given  for  a  number  of  tests  of  the  methods  and 
some  comments  are  made.  In  what  follows,  the  binomial 
coefficients  will  be  denoted  by  C(n,k),  vectors  by  small 
letters  underlined  and  matrices  by  capital  letters.  Small 
letters  with  subscripts  will  denote  the  elements  of  vectors 
and  matrices  where  appropriate. 

2.  The  Direct  Method:  In  this  section  a  formal  solution  to 


the  problem  is  presented  and  analyzed  as  an  ^proach  to 
computing  the  probability  mass  function.  Let  pj  be  the 
probability  of  a  success  on  the  i-th  trial  and  let  k  be  the 
number  of  successes  in  N  trials.  Let  Ic(n,1!)  s®*'  of  ^  e 

91'^  such  that  (i).  Zj  €  (0,1),  for  j=l,2,...,N  and  (ii), 

K 

j=i 

Then  the  probability  ,  Pr(X=k),  that  the  random  variable  X 
equals  k  is  given  by, 

'n  ,  ' 

E  JJ  il-Pj)] 

Each  value  of  the  probability  mass  function  requites  the 
summation  of  C(N,k)  products  each  of  which  can  be 
expressed  as  N-1  multiplications.  In  addition,  the 

computation  requires  C(N,k)-l  additions.  Thus  the 

total  numbCT  of  floating  point  calculations  is 
(N-l)C(N,k)+C(N,k)-l  for  any  value  of  k.  Summing  over  k 
yields  an  operations  count  of  N2”  -  (N-l)  =  0(N2”),  so  that 
although  simple  fast  algorithms  exist  to  generate  the  set  of 
all  combinations,  the  exponential  complexity  class  of  the 
algorithm  makes  this  unfeasible  except  at  the  tails  of  the 
distribution  and  for  small  N.  In  the  evaluation  of  the 
methods  developed  in  sections  3  and  4  we  will  use  this 
calculation  procedure  to  estimate  the  probabilities  for 

comparison  purposes. 

3.  The  Probability  Mass  Function  in  Terms  of  The 
Factorial  Moments:  Noting  that  the  factorial  moments  of 
a  discrete  random  variable  X,  X  e  {0,1,2,  with 

probability  mass  function,  f(x),  are  defined  by  the  equation, 
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N 

x=o 

it  was  shown  by  Laurent  [4]  that  f(x)  has  the  equation, 

fix)  =  x; 

j=x  7  ■ 


The  derivation  of  this  result  is  simple  and 
instructive.  If  the  defining  equation  for  the  r-th  factorial 
moment  is  divided  by  r!,  and  r  is  varied  from  0  to  N,  the 
resulting  system  of  N+1  linear  equations  for  f(x)  is  upper 
triangular  with  ij-th  elanent  equaling  C(j-l>i-l)  when  j>i  and 
0  when  j<i.(note  that  ij=l,2,...,N+l).  It  is  easily  seen  that 
the  columns  of  this  matrix  are  just  the  rows  of  Pascal’s 
triangle.  The  elements  of  the  inverse  matrix  are  just  (-1)**^ 
times  the  elements  of  this  matrix  and  so  Laurent’s  formula 
follows  immediately.  Examination  of  this  formula  reveals 
potential  problems  in  the  computations.  In  particular,  the 
coefficients  of  the  quantities  Ptr/r!  grow  rapidly  with  N  and 
alternate  in  sign,  hi  order  for  the  resulting  sum  to  be  small 
cancellations  must  occur  and  so  it  is  unlikely  that  the 
function  can  be  calculated  with  good  relative  precision.  The 
fact  that  the  coefficient  matrix  has  positive  elements  and  is 
upper  triangular  suggests  solving  for  the  values  of  f(x)  by 
back  substitution.  Unfortunately,  the  matrix  is  very  ill- 
conditioned  with  condition  number  Kj  =  2*^^.  Thus  assuming 
that  the  quantities  p,,]/r!  can  be  found,  they  will  be  subject  to 
rounding  error  and  we  will  be  considering  a  classic  discrete 
ill-posed  problem.  We  shall  see  that  this  problem  can  be 
successfully  solved  by  application  of  Tikhonov 
regularization.  If  we  define  the  factorial  cumulants  in  a 
manner  analogous  to  the  usual  cumulants  and  denote  the  r-th 
such  quantity  by  Kjjj,  it  can  be  shown  that  for  the 
distribution  of  interest, 

7=1 


Next  let  w,  =  P(,]/r!  and  v,  =  K[,]/r!  so  the  w,  can  be 
generated  from  the  v,  by  the  following  recursion. 


(r+1) 


I  +  E  (7+1) 


for  r  >  0.  Combining  these  two  equations  yields, 


(r+1) p 


E 


i7r=l 


3+1 


r-j 


As  indicated  above,  the  system  of  equations  for  the 
probability  mass  function,  given  the  factorial  moments 
becomes  increasingly  ill-conditioned  as  N  increases.  For  this 
reason  we  ^ply  the  method  of  Tikhonov  regularization  and 
restate  the  problem  as  a  constrained  least  squares  problem: 


subject  to  the  constraints, 

N 

Vjio  ^  j  ^  N)  ,fj  t  0  ;  52  1 

3=0 


The  matrix  A  in  these  equations  is  the  original  upper 
triangular  matrix  for  the  system  with  the  first  row  and 
column  deleted.  The  idea  of  Tikhonov  regularization  is  to 
choose  a  suitable  value  of  X  by  some  criterion.  A  number  of 
ways  of  choosing  this  parameter  are  described  in  Hansen  [2]. 
We  have  considered  one  of  those  and  also  one  of  our  own 
which  is  particular  to  this  problem.  For  reference  purposes, 
these  will  be  denoted  by 

(1) .  The  Generalized  Cross  Validation  Method 
(GCV)  of  Golub,  Heath  and  Wahba  [1]. 

(2) .  MSRE  in  which  the  Mean  Square  Relative 
Error  is  calculated  by  comparing  the  factorial  moment 
solution  to  a  few  "true"  values  calculated  by  the  direct 
method  at  each  end  of  the  solution  vector. 

It  should  be  noted  in  this  context  that  even  when  N  is  fairly 
large,  the  first  few  values  of  the  probability  mass  function  in 
each  tail  of  the  distribution  are  easily  calculated.  In  either 
case,  a  1-dimensional  nonlinear  optimization  problem  for  X 
is  solved  which  requires  repeated  solution  of  the  following 
constrained  linear  least  squares  problem: 

Let  Jibe  the  N-vector  with  components  P(i,,...,p,N)/N! 
and  In  be  the  N  X  N  identity  matrix,  then  for  each  X  we 
solve. 
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subject  Co  the  constraints 


j=l,  .  .  .  ,N 


N  N 

E  '  ^-n  (I'-pj) 

j=i  j=i 


where  pj  is  the  known  probability  of  a  success  on  the  j-th 
trial. 

The  results  of  computational  experiments  with  this 
approach  are  given  in  section  5.  It  will  be  shown  there  that 
both  of  the  methods  indicated  above  for  choosing  the 
parameter  A,  give  satisfactory  results.  However,  the  GCV 
method,  which  requires  the  computation  of  the  Singular 
Value  Decomposition  of  the  matrix  A,  appears  to  give 
slightly  pooro-  results. 

4.  Approximation  By  Expansion  In  Orthogonal 
Polynomials:  The  technique  to  be  used  here  is  applicable 
to  any  discrete  distribution  and  will  be  described  in  very 
general  terms. 


Let  f(x)  be  the  discrete  distribution  to  be 
approximated  and  let  f(x)  be  defined  on  the  set  Q, 
f2={Xg,Xi,...,x„,}.  Let  p(x)  be  a  second  known  discrete 
distribution  with  domain  £2.  Finally  let  {hj(x),j=0,l,...}  be  a 
set  of  polynomials  orthogonal  to  each  other  with  respect  to 
p(x)‘on  D;  that  is  such  that 


p  ix)  (x)  hj  (x) 

X^Xq 


0  , 


By  matching  mom»its  we  shall  find  coefficients  ag,  aj,  aj,  ... 
such  that 

fix)  «p(x)  [ao+a^Ai  ix)  +a2h2  (x)  +  .  ,  .  ] 


1  Xq  Xo  ...  Xo 
1  x^  xl  ...  xf 
^  ~  1  Xj  xf  .  .  .  xf 

1  x^  x^  ...  x/ 


and  D=diag(p(xo),p(Xi),  ...,  p(xj).  Let  A  =  D*^  have  QR 
factorization, 

A  =  Q  e  R(^*i)x(r+i) 

Then  the  columns  of  the  matrix  B=D  ‘^  are  orthogonal 
with  respect  to  the  weight  matrix  D  and  are  the  orthogonal 
polynomials  ho(x),  h/x),  ...  hj(x)  evaluated  on  fl. 
Furthermore,  if  is  the  vector  composed  of  the  0-th  moment 
and  the  first  r  moments  of  f(x)  then  the  coefficients  %  in  the 
expansion  are  the  solutions  to  the  system  of  equations 
R„^a  =  M. 

Again  formulas  for  the  moments  and  cumulants  of  the 
distribution  under  study  are  easily  calculated  as  functions  of 
the  known  probabilities  of  success  on  individual  trials.  To 
this  end,  for  each  pj,  let  dn+i(j)  be  defined  by,  dj  =  pj 

then  the  cumulants  K,  are  given  by 

=  E^r(j) 

The  moments  are  then  found  from  the  cumulants  by  the  well 
known  formula. 


If  an  approximation  utilizing  the  first  r  terms  of  this  _  ^  . . 

expansion  is  to  be  generated,  then  the  following  result  can  ~  ^  ^ 

be  easily  derived. 


THEOREM:  Let  X  6  R(nw-iw»+i)  g^d  D  e  R(n>+iWBM-i)  -pijg  nigtrix  Rj,  tends  to  become  ill-conditioned  as 

defined  as  r  increases  because  the  matrix  bectmes  ill-conditioned. 

The  degree  of  ill  conditioning  is  a  function  of  the 
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Table  I. 

Comparison  of  the  directly  computed  CDF  to  those  obtained  by  the  GCV,  MSRE  and  series  approximation  method  for  the 
case  of  N=20  and  100  simulations.  The  table  entries  are  relative  differences. 


GCV  (N=20)  100  trials 

lower  tail  upper  tail 

1%  5%  10%  1%  5%  10% 

Max  5.5  X  IQ-^  2.2  x  lO'^  1.3  x  lO'^  3.5  x  lO'^  3.9  x  10  '^  1.3  x  lO'^ 

Q3  3.1  xia®  2.9  xia"  1.4x10-"  8.5  x  lO'^  -4.5x10'’  1.7  x  10'^ 

Med  5.4  xig^  1.0  x  Id"  3.0x10'"  2.3x10-^  -5.1x10  *  3.5x10  '* 

Q1  -4.9x10-^  -5.3x10-^  -1.4x10'^  -3.1x10’  -1.6x10-^  -3.6x10-^ 

Min  -6.0  X  10-’  -8.9  x  10  «  -8.5  x  10  ®  -6.2  x  10  *  -7.3  x  10''  -5.9  x  lO  ’ 


MSRE  (N=20)  100  trials  4  end  points 


lower  tail 

upper  tail 

1% 

5% 

10% 

1% 

5% 

10% 

Max 

1.1  X  10-’ 

2.1  X  10-’ 

3.4  X  10  * 

1.4  X  10  ’ 

8.9  X  10-'“ 

4.4  X  10’ 

Q3 

1.4  X  10* 

5.9  X  la’ 

4.9  X  10  ’ 

1.6  X  10  ‘“ 

5.3  X  10" 

4.4  X  10'“ 

Med 

1.1  X  la’ 

1.5  X  10  '“ 

4.9  X  10  '“ 

2.9  X  10  “ 

-1.6  X  10  '“ 

5.4  X  10" 

Q1 

-6.6  X  10’ 

-3.6  X  la’ 

-1.5  X  10  ’ 

.  -4.7  X  10  “ 

-3.6  X  10-'“ 

-2.7  X  10  '“ 

Min 

-5.7  X  10-’ 

-4.7  X  10  * 

-7.6  X  10  * 

-6.3  X  10  '“ 

-2.9  X  10-’ 

-2.9  X  10-’ 

SERIES  (N=20)  100  trials 


lower  tail 

upper  tail 

1% 

5% 

10% 

1% 

5% 

10% 

Max 

4.1  X  la" 

5.4  x  lO"* 

1.9  X  10  * 

6.3  X  10“ 

3.4  X  10“ 

3.8  X  10“ 

Q3 

2.1  X  10-* 

-2.0  X  10"' 

1.0  X  10"* 

-1.5  X  10“ 

7.1  X  10* 

6.3  X  10  * 

Med 

3.7  X  la’ 

-7.9  X  KT* 

-1.5  X  10“ 

1.6  X  10  * 

3.0  X  10  * 

2.0  X  10* 

Q1 

-1.3  X  10  * 

-2.1  X  10* 

-5.9  X  10“ 

-1.6  X  10  ® 

1.3  X  10  * 

-6.5  X  10’ 

Min 

-2.6  X  10* 

-2.8  X  la* 

-6.3  X  10’* 

-1.4  X  10“ 

-1.3  X  10“ 

-4.4  X  10“ 
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distribution  p(x).  The  degree  of  ill-conditioning  can  be 
controlled  by  the  choice  of  r.  It  is  expected  that  the  quality 
of  the  approximation  will  improve  as  the  number  of 
moments  r  inaeases.  In  fact,  if  m  is  finite,  then  using  m 
moments  will  give  an  exact  result  On  the  other  hand,  as  the 
number  of  moments  increases,  so  does  the  condition  number 
of  Rii  and  so  a  reasonable  balance  between  accuracy  and 
conditioning  must  be  found.  Since  Rn  is  upper  triangular, 
the  condition  number  is  easily  calculated  to  help  in  this 
decision. 

5.  Computational  Results  and  Conclusions:  All 
computations  presented  were  performed  in  IEEE  Binary 
Rounded  Double  Precision  floating  point  arithmetic  on  an 
Intel  Pentium  processor.  The  codes  were  written  in 
WATCOM  FORTRAN  iT^  and  run  under  the  OS/2  2.11 
operating  system.  The  constrained  least  squares  problems 
were  solved  using  the  codes  of  Hanson  and  Haskell,  TOMS 
Algorithm  587  [3].  In  all  cases,  the  values  of  the  cumulative 
distribution  function  resulting  firom  the  computed  probability 
mass  functions  found  by  the  methods  of  sections  3  and  4 
were  compared  to  like  values  found  by  the  direct  method  of 
section  2.  The  values  presented  in  Table  I  are  for  the  relative 
differences  between  the  computed  values.  It  should  be  noted 
that  the  values  computed  directly  are  also  subject  to  error.  In 
particular,  although  a  value  calculated  directly  is  an  unbiased 
(with  respect  to  the  distribution  of  the  rounding  errors) 
estimate  of  the  true  value,  its  variance  grows  as  N  grows 
and  so  any  confidence  interval  grows  as  weU.  Thus  we  will 
refer  to  these  as  relative  differences  in  the  computed  values 
but  not  as  relative  errors.  Rather  than  give  mean  relative 
differences  in  Table  I,  we  give  a  five  number  display  which 
includes  the  extremes,  the  quartiles  and  the  median.  In 
addition,  we  give  results  for  "nominal"  1%,  5%  and  10% 
critical  points  in  both  tails  of  the  distribution.  Since  the 
distribution  is  discrete,  these  levels  are  not  exact  and 
r^resent  the  relative  difference  at  the  point  on  the  CDF 
which  is  closest  to  the  indicated  probability  level. 

Results  are  presented  for  the  case  of  N=20  for  the 
GCV  method  and  for  the  MRSE  method.  These  are  based  on 
100  randomly  generated  sets  of  probabilities  of  success.  For 
the  MSRE  method,  results  are  given  for  the  cases  of  4 
directly  calculated  values  used  at  each  end  to  estimate  the 
Mean  Square  Relative  Error.  The  MSRE  method  for  N=25 
and  4  points  at  each  end  gave  similar  results  and  is  not 
shown  due  to  space  limitations.  The  values  in  the  table 
indicate  that  ail  methods  of  choosing  the  lambda  yield 
satisfactory  results  while  the  MSRE  method  gives  results 
which  are  slightly  better  than  those  found  by  the  GCV 
method.  The  advantage  of  the  GCV  method  is  that  it 
requires  no  direct  calculations  of  the  tails  of  the  distribution. 


The  disadvantage  is  that  it  requires  the  calculation  of  the 
Singular  Value  Decomposition  (SVD)  of  one  matrix.  In  our 
experience,  the  extreme  ill-conditioning  of  the  matrix  caused 
the  SVD  code  to  fail  when  N  approached  about  50.  It  should 
be  noted  that  this  is  the  point  at  which  the  elements,  C(N,k), 
of  the  matrix  can  no  longer  be  represented  exactly  in  the 
floating  point  system. 

For  the  orthogonal  expansion  approximation 
method,  results  are  given  for  N=20  using  8  moments  and  the 
Binomial  distribution  with  p  chosen  so  the  its  mean  matches 
that  of  the  target  distribution.  Again  results  are  given  for  100 
simulations.  Like  any  such  expansion,  the  values  in  the  tail 
area  are  particularly  sensitive  to  the  number  of  moments 
used.  However,  the  simulation  results  indicate  that  even 
though  some  of  the  probability  estimates  in  the  tails  can  be 
negative  (and  small)  the  values  of  the  CDF  at  the 
approximate  1%,  5%  and  10%  levels  are  not  badly  effected. 
The  overall  results  can  be  improved  slightly  if  the  most 
extreme  few  probabilities  are  calculated  directly. 

In  conclusion  we  note  that  any  of  the  methods 
described  can  yield  values  of  the  CDF  which  are  satisfactory 
for  practical  work.  The  method  based  on  the  factorial 
moments  is  more  con:q)utationaUy  intensive  and  can  be 
expected  to  yield  more  accurate  results.  The  approximation 
method  yields  less  accurate  results  in  general  unless  all 
moments  are  used  in  which  case  the  results  are  comparable. 
The  approximation  method  was  tested  for  randomly  chosen 
p  on  (0,1).  Intuitively,  we  would  expect  it  to  perform  better 
in  situations  where  the  pj  are  different  but  are  on  a  narrower 
interval. 
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Abstract: 

Kokoska  (1987)  suggested  a  set  of  maximum 
likelihood  estimators  relevant  to  the  analysis  of  the 
Inhibition/Promotion  (I/P)  mammary  cancer 
chemoprevention  experiment.  This  set  of  estimators 
has  been  extended  and  studied  in  various  detail  in  a 
number  of  related  papers,  Kokoska  (1988a,  1988b), 
Hsu  (1990),  Kokoska,  Hardin,  Hsu,  and  Grubbs 
(1993).  Often,  however,  investigators  have  some 
prior  knowledge  of  a  compound  tested  in  such 
experiments  due  to  its  chemical  stracture  and 
similarity  to  related  compounds.  In  such  situations, 
experimenters  often  wish  to  exploit  this  prior 
knowledge  in  order  to  reduce  the  costs  of 
experimentation.  Thus,  this  paper  examines 
Bayesian  estimators  for  this  purpose  and  numerical 
algorithms,  based  on  the  Gibbs  sample  (Gelfand  and 
Smith,  1990)  and  the  rejection  method  (Smith  and 
Gelfand,  1992),  with  which  to  compute  the  posterior 
distribution.  The  methodologies  are  illustrated  with 
experimental  data  taken  from  Grubbs  (1993). 

1.  Introduction. 

The  Inhibition/Promotion  (I/P)  CancCT 
Chemoprevention  Experiment  is  designed  to  investigate 
the  effect  of  compounds  that  can  be  given  in  the  diet  on 
incidence  rates  of  cancer.  These  experiments  are  often 
administrated  by  the  National  Cancer  Institute.  The 
primary  purpose  of  the  experiment  is  to  isolate  and 
identify  potential  cancer  inhibiting  or  promoting 
substances  in  human.  Variables  of  interest  in  these 
experiments  are  the  incidence  of  tumors  in  the  animals, 
the  numbw  of  tumors  per  animal,  and  the  rate  at  which 
tumors  develop.  The  Chemoprevention  Branch, 
Division  of  Cancer  Prevention  and  Control,  in  the 
National  Cancer  Institute  has  issued  guidelines  for 
statistical  analysis  such  as  log-rank  test  and  Armitage 
test.  However,  difficulty  in  analyzing  these 
experiments  may  occur  due  to  the  fact  that  the 
experiment  is  terminated  before  all  the  induced  tumors 
have  been  observed  (i.e.,  right  censored  data). 


Therefore,  a  confounding  of  fewer  observed  tumors  in 
treatment  group  conq)ared  to  control  could  be  the  result 
of  a  decreased  number  of  induced  tumors,  a  decreased 
growth  rate  of  tumor,  or  both  occur.  The  problem 
results  from  the  fact  that  the  number  of  induced  tumor 
(M)  in  each  animal  is  dependent  upon  the  time  to  tumor 
detection  (T).  Current  statistical  methods  do  not 
account  for  this  confounding  since  they  do  not  test  the 
number  of  induced  tumor  and  the  time  to  tumor 
detection  simultaneously. 

Kokoska  (1987)  suggested  a  set  of  maximum 
likelihood  estimators  relevant  to  the  analysis  of  the 
mammary  cancer  chemoprevention  experiment.  This 
set  of  estimators  have  been  extended  and  studied  in 
various  detail  in  a  number  of  related  papers,  Kokoska 
(1988a,  1988b),  Hsu  (1990),  Kokoska,  Hardin,  Hsu, 
and  Grubbs  (1993).  In  this  p£q)er  the  basic  idea  of 
Kokoska's  method  will  be  reviewed. 

2.  Mathematical  Model  of  Kokoska's  Approach 

Kokoska  proposed  modelling  the  number  of 
induced  tumors,  M,  as  a  Poisson  distribution,  and  the 
time  to  tumor  detection,  T,  as  a  gamma  distribution. 
Suppose  that  a  treatment  group  consists  of  n  animals, 
and  wi,  (i=l,  2, ...,  n)  is  the  number  of  promoted  tumors 
in  animal/.  Let  be  the  observed  time  to  detection  of 
tumor  i  in  animal  /  (j  =  1,  2, ...,  w.),  and  let  JitJ  be  the 
number  of  observed  tumors  for  the  animal  i  at  the  time 
tj.  Further,  denote  the  mean  and  variance  of  X  Pw^nd 
o*M,  respectively.  Let  F(t)  be  the  cumulative  tosity 
function  (cdf)  of  T.  Kokoska  (1987)  has  shown  that 
J(tJ  has  mean  and  variance 

These  result  demonstrate  mathematically  the 
dependence  of  the  number  of  detectable  tumors  at  time 
ti  on  the  mean  number  of  induced  tumors  and  the  time 
to  tumor  detection. 

The  log-likelihood  function  of  J(t)  can  be 
shown  as  below. 
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LL(A,«J)  =  F(t;;aJ)  +  s,{ln(A)  -  aln(P)  - 

ln(lja)))  +  S2( a-l)  -  s/p  -  ln(K) 

where  Fffi*;  •,  *)  denotes  cumulative  density  function  of 
T,  and  Sj  =  wi;,  $2  =  Up  and 

and  all  the  animals  are  sacrificed  at  the  end  of  the 
experiment  t*. 

This  log-likelihood  can  be  numerically  optimized  to 
obtain  the  MLE.'s  of  interest  using  the  IMSL 
FORTRAN  library  subroutine  DBCONF.  The  mean 
number  of  induced  tumors  per  animal  can  be  estimated 
via  the  MLE  X  .  However,  the  parameter  p,  the  mean 
time  to  tumor  detection,  is  of  more  biological 
significance  than  the  estimates  of  the  parameters 
associated  with  each  of  the  continuous  distributions. 
An  MLE  of  p  can  be  easily  obtained  using  the 
invariance  ^  property  of  MLE's, 
i.e.,  =  &P  (Roussas,  1973). 

This  parametric  model  has  been  extended 
using  various  assumptions  to  eight  models  (Kokoska, 
1988;  Hsu,  1990;  Hardin  and  Hsu,  1991;  Kokoska  et 
al.,  1993)  for  different  kinds  of  data  assumptions.  In 
this  paper,  however,  Poisson  and  gamma  distributions 
for  the  number  of  induced  tumors  in  each  animal  and 
their  times  to  tumor  detection,  respectively,  are 
examined  in  comparison  to  the  estimates  using  the 
Bayesian  approaches. 

3.  Bayesian  Methods 

Since  investigators  may  have  some  prior 
knowledge  of  a  compound  tested  in  such  experiments 
due  to  its  chemical  structure  and  similarity  to  related 
compounds,  they  may  wish  to  exploit  this  prior 
knowledge  to  reduce  the  duration  of  the  experiment  or 
to  lessen  the  number  of  experimental  animals  due  to  the 
cost  of  experimentation.  This  section  examines 
Bayesian  estimators,  based  on  the  Gibbs  sampler 
(Gelfand  and  Smith,  1990)  and  the  rejection  method 
(Smith  and  Gelfand,  1992),  are  both  presented.  These 
techniques  are  applied  to  an  actual  experimental  data. 

Gibbs  sampling  has  allowed  the  computation 
of  complicated  statistical  models  based  on  Bayesian 
posterior  inference.  Additionally,  the  rejection  method 
of  Smith  and  Gelfand  is  a  straightforward  sampling¬ 
resampling  perspective  that  allows  the  computation  of 
Bayesian  estimators  using  easily  implemented 
calculation  strategies.  The  methodologies  are 
introduced  as  foUows. 

(1)  Gibbs  sampling  method 

Suppose  that  X,  Y,  and  Z  are  the  random 


variables,  and  their  conditional  distributions, 

are  known.  If  initial 
values  of  Xg,  yg  are  specified,  then  a  "Gibbs  sequence 
of  value"  of  the  random  variables,  Xg,  Yg,  Zg,  X/,  Y/, 
Zj,  ...,  X^',  Y^,  Z^,  can  be  obtained  iteratively  by 
alternately  generating  values  from 

and 

y/-/ra-.zW'=^'.Z/=Zi')  and 

/z«'yzlX.'=Xi',  y/=y,') 

It  turns  out  that  under  reasonably  general 
conditions,  the  distribution  of  X/  converges  to  /x(x), 
which  is  the  true  marginal  of  X  as  -<»  (Casella  and 
George,  1992).  Thus  for  k  large  enough  the  final 
obsCTvation  X^=x^  is  effectively  a  sample  firom^x). 
So  are  the  observations  //y)  and  f-^z). 

In  this  paper  the  conditional  probabilities  of 
the  parameters  of  interest  are  assumed  as  follows. 

f{X\tt,  P)  ~  Gamma{p/a,  1),  and 

/( a\  p.  A)  ~  Normali p/A,  A),  and 

f(p\a,  A)  ~  Normali  a  A,  A). 

One  thousand  iterations  were  made  to  get  the  marginal 
distributions  for  the  parameters,  and  10,000  sample 
sizes  were  generated. 

(2)  Rejection  Method 

For  fixed  j,  let  i)  =  l(^  s)p(3  where 
l(&  x)  is  the  likelihood  function  of  0  and  p(3  is  the 
prior  distribution  of  fl.  If  is  the  M.L£.  of  2,  and  M 
=  l(^,  x).  The  first  step  of  this  method  involves  the 
generation  of  0  from  p(  d)  and  also  the  generation  u 
from  continuous  uniform  distribution  (0, 1).  Second, 
evaluate  the  following  procedure 

«  ^  (Mp(3)  =>  accept  P 

u  >  fxfx)/ (Mp(3)  =>  reject  Q 

Thus  a  sample  of  the  posterior  distribution  of 
the  parameter  2  can  be  obtained  if  the  above  procedures 
are  applied  repeatedly.  In  this  paper  uniform 
distributions  were  used  for  the  prior  distributions  of  the 
parameters,  a,  P,  and  \,  and  10,000  simulations  were 
generated. 

4.  Application 

In  a  study  (Grubbs,  1993),  sixty  female 
Sprague-Dawley  rats  were  randomly  divided  into  2 
groups.  In  Group  1  the  rats  were  treated  by  retinoid 
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vehicle,  and  the  rats  in  Group  2  were  treated  by  RTBE 
(934  mg/kg  of  diet).  Then  the  MNU  was  administrated 
to  every  animal.  In  both  experiments,  the  animals  were 
palpated  for  the  detection  of  mammary  tumors.  The 
investigation  was  terminated  182  days  after  the 
injection  of  carcinogen.  Tables  1  and  2  give  the 
survival  times,  the  numbers  of  induced  tumors,  and  the 
times  of  development  of  mammary  cancw  for  each 
group. 

5.  Discussion 

Tables  1  and  2  present  the  maximum 
likelihood  estimates  for  the  mean  number  of  induced 
tumors  per  animal  and  for  the  mean  time  to  hitnnr 
detection  and  the  corresponding  95%  confidence 
intervals  and  95%  credibility  intervals  using  classical 
approach  and  the  Bayesian  techniques  for  each  group. 


Clearly,  the  estimates  and  /I,,  using  Kokoska 

approach  and  the  rejection  method  of  Smith  and 
Gelfand  are  very  close;  and  their  95%  confidence 
intervals  and  credibility  intervals  are  similar,  as  well. 
However,  the  estimates  using  the  rejection  method  seem 
to  be  a  slightly  better  than  that  of  the  classical 
approach  since  the  credibility  intervals  are  narrower 
than  the  corresponding  confidence  intervals.  Ihe 
estimates  of  the  parameters  of  interest  using  the  Gibbs 
sanq)ling  techniques  are  not  good  compared  to  the 
estimates  using  either  the  classical  or  the  rejection 
methods.  This  might  be  due  to  the  selection  of 
inappropriate  prior  conditional  distributions  for  the 
parameters  of  interest.  Work  is  currently  undergoing  to 
incorporate  researchers'  experience  to  obtain  better 
prior  conditional  densities  for  the  parameters. 


Table  1.  Estimates  and  95%  Confidence/Credibility  Intervals  of  the  Parameters  for  Group  1  (Control  Group) 


■ 

Kokoska's  method 

Gibbs's  Sampler 

Rejection  Method 

■ 

Estimate 

8.69 

15.81 

B 

95%  Confidence 
/Credibility  Interval 

(13.15, 18.46) 

(5.14, 18.57) 

(13.30, 17.65) 

■ 

Estimate 

185.83 

241.25 

187.76 

95%  Confidence 
/Credibility  Interval 

(171.92, 199.75) 

(26.14, 579.43) 

(173.05, 203.79) 

Table  2.  Estimates  and  95%  Confidence/Credibility  Intervals  of  the  Parameters  for  Group  2  (Treatment  Group) 


□ 

Kokoska's  method 

Gibbs's  Sampler 

Rejection  Method 

■ 

Estimate 

9.19 

8.71 

932 

B 

95%  Confidence 
/Credibility  Interval 

(7.46, 11.07) 

(5.15, 18.50) 

(7.70,10.67) 

■ 

Estimate 

15530 

242.17 

154.55 

B 

95%  Confidence 
/Credibility  Interval 

(143.03, 167.56) 

(2635, 590.10) 

(138.13, 170.49) 
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ABSTRACT.  Molecular  similarity  anedysis  in¬ 
volves  the  analysis  of  data  based  on  complex  data 
types  used  to  represent  the  information  in  molec¬ 
ular  structures.  With  many  complex  data  types, 
we  lose  many  nice  features  of  vector  spaces,  but 
retain  the  concept  of  proximity.  Let  Jf  be  a  ran¬ 
dom  variable  with  density  /  defined  on  a  space  H. 
Let  g  be  any  other  density  defined  on  Q.  Define 
the  relative  aggregation  a(/|p)  of  /  with  respect 
to  g  by 

Suppose  X  =  (Xi,JC2)  with  marginal  densities 
/i  and  /2.  Define  p  =  |/  +  I/1/2.  Define  the 
dependence  coefficient  6{Xx,X2)  by  6(Xi^X2)  = 
a(/|flr)/a(/)  where  <x{f)  is  the  relative  aggrega¬ 
tion  of  /  with  respect  to  itself.  We  show  that 
if  /  is  the  bivariate  normal  density,  then  the  i- 
coefficient  varies  monotonically  with  the  correla¬ 
tion  coefficient.  The  5-coefficient  can  be  estimated 
using  random  quadrat  sampling  when  a  suitable 
proximity  measure  is  defined  on  D.  Two  repre¬ 
sentations  used  in  computing  molecular  similarity 
are  shown  to  have  a  high  delta  coefficient. 

1.  Introduction 

Statisticians  continue  to  encounter  increas¬ 
ingly  complex  data  types.  In  our  application  of 
molecular  similarity  analysis  to  drug  discovery  re¬ 
search,  examples  include  binary  vectors  represent¬ 
ing  the  presence  or  absence  of  up  to  300  molec¬ 
ular  fragments,  labeled  graphs  representing  the 
bonding  structures  of  molecules,  and  scalar  fields 
in  IR^  for  representing  the  electrostatic  fields  of 
molecules  (Johnson,  1989,  Johnson  and  Maggiora, 
1990).  Problems  associated  with  high  dimension¬ 
ality  abound,  and  in  some  cases,  we  even  lose 
the  natural  definitions  of  such  concepts  as  coordi¬ 
nates,  location,  and  linear  transformations.  One 
important  concept  that  remains  is  proximity.  If 
two  objects  are  represented  by  the  same  data  type, 
we  can  virtually  always  measure  how  similar  one 
is  to  the  other. 

Recently,  Cheng  and  Johnson  (1994a,  1994b) 


proposed  the  concept  of  relative  aggregation  co¬ 
efficients  as  a  method  of  developing  statistical  in¬ 
ference  on  probability  spaces  in  which  a  proximity 
measure  has  been  defined.  Here  we  Ulustrate  the 
use  of  relative  aggregation  coefficients  in  develop¬ 
ing  a  general  measure  of  dependence  between  two 
random  variables.  Although  our  approach  gener¬ 
alizes  directly  to  arbitrary  probability  spaces,  the 
discussion  will  be  limited  to  Euclidean  spaces.  Af¬ 
ter  defining  relative  aggregation  coefficients  and 
presenting  a  moment  estimator  for  them,  we  de¬ 
velop  a  coefficient  of  dependence  and  show  its  re¬ 
lationship  to  the  bivariate  normal  correlation  co¬ 
efficient.  We  then  compute  the  dependence  coeffi¬ 
cient  for  two  high-dimensional  vector  representa¬ 
tions  used  for  measuring  molecular  similarity. 


2.  Relative  Aggregation  Coefficients 

Let  /  and  g  be  probability  density  functions 
defined  on  IR*  such  that  J  pg  exits.  Then  the 
relative  aggregation  coefficient  (RAC)  ct{f\g)  of  / 
with  respect  to  ^  is  defined  by 


a(/l5)  = 


SP9 

{Sf9r 


If  ^  =  /,  then  we  write  a{f)  for  o:(/|^),  and  we 
all  a{f)  the  self  aggregation  coefficient  of  /. 

Some  insight  into  aggregation  coefficients  is 
gained  by  viewing  these  integrals  as  moments  of 
f{Z)  where  g  is  the  density  of  Z.  Write  Eg[p{Z)] 
for  Jpg  and  call  it  the  i’th-relative  moment  of  f 
with  respect  to  g,  or  simply  the  i’th  self  moment 
of  /  if  5  =  /.  Then  the  RAC  of  /  with  respect  to  g 
is  simply  the  second  relative  moment  of  f  with  re¬ 
spect  tog  divided  by  the  square  of  the  correspond¬ 
ing  first  relative  moment.  It  follows  immediately 
that  a{f\g)  >  1. 

What  would  make  this  ratio  large?  Consider 
any  other  density  h  for  which  J  fh  <  ej p.  Let 
g  be  the  mixture  pf  -f  qh  where  p  -f  g  =  1.  Then 
Sf^9  >  pf  and  f  fg  <  //*(p  +  gc).  It  fol- 
lows  that  OL(f\g)  >lxr(/)/(p-^ge)^  which  goes  to 
p“^a(/)  as  e  — ►  0.  Since  a{f)  >  1,  we  can  always 
find  a  g  so  as  to  make  oc{f\g)  arbitrarily  large. 


Dependence  coefficient 
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3.  A  Coefficient  of  Dependence 

Let  X  and  Y  be  two  landom  variables  defined 
on  1R*‘  and  IR*”  where  hi  +  ^2  =  k.  Let  /  denote 
the  joint  density  of  {X,Y),  and  let  fi  and  /2  de¬ 
note  the  respective  marginals  of  /.  Let  h  be  the 
product  density  /1/2.  Then  X  and  Y  are  indepen¬ 
dent  if  and  only  if  /  =  h.  Define  g  =  +  |h, 

and  define  the  dependence  coefficient  6{X,Y)  by 


S{X,Y)  = 


«(/|g) 
“(/)  ' 


be  fixed  vectors  of  length  hi,  and  h2.  Then  the 
density  /  of  T{X  -  Xo,Y  -  yo)  is  given  by 

/  =  \{T-^MT-\X-Xo,Y-y,)) 

=  \T-^\\Tf^\f{T-^iX  -  *o),r-^(y  -  y„)). 

It  then  follows  that  the  marginal  densities  of  /  are 
given  by  \T~^\fi{T~^{X  -  Xo)  and 
\T^^\fi{Tg^{X  —  x„).  Straight  forward  calcula¬ 
tions  give  S{X,Y)  =  6{T,{X),T^{Y)), 


Clearly  if  /  =  A,  then  F)  =  1.  On  the 
other  hand,  we  see  from  the  preceding  section  that 
JT,  y)  2  whenever  /  /A  ciri  0. 

Figure  1  plots  the  dependence  coefficient  in 
the  case  /  is  the  bivariate  normal  for  various  veil- 
ues  of  the  correlation  coefficient.  A  distinct  mono¬ 
tonic  relationship  is  obtained.  The  correlation  co¬ 
efficient  in  the  figure  could  be  replaced  by  its  abso¬ 
lute  value  as  aggregation  coefficients  are  invariant 
under  a  particular  subclass  of  linear  transforma¬ 
tions  on  IR*,  as  is  now  demonstrated. 

Figure  1 .  Dependence  coefficient  versus  the 
log  of  one  minus  the  correlation  coefficient. 
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One  minus  the  correlation  coefficient 

Define  T  by 

r  = 

where  7^  and  Ty  are  square-nonsingular  matrices 
with  ki  and  A2  rows  respectively.  Let  Xo  and  yo 


4.  A  Consistent  Estimator 

Let  Xj  y,  /,  /i,  /2,  and  A  be  as  defined.  We 
seek  a  consistent  estimator  of  S{XjY)  which  is 
a  ratio  of  two  RACs.  Since  the  denominator  is 
bounded  away  from  zero,  it  follows  that  the  ra¬ 
tio  of  consistent  estimators  of  the  numerator  and 
denominator  of  y)  is  a  consistent  estimator 
of  y).  A  consistent  moment  estimator  of  a 
RAC  is  presented  in  Cheng  and  Johnson  (1994c) 
elsewhere  in  this  volume.  Briefly,  it  is  constructed 
as  follows:  Let  5  =  {(*1,3/1),  be  a 

dataset  of  N  independent  samples  from  /,  and  let 

be  m  independent  samples  from  density 
g  where  g  is  any  other  density  defined  on  IR*.  Let 
d  be  any  proximity  measure  defined  on  IR* .  Define 

St{z)  =  {(*,  3/)M((«,  y)i  <  r}, 

and  define  7ir(^{),  f  =  1,  m,  to  be  the  cardinal¬ 
ity  of  the  set 


{(«,y)l(*>y)  €  Br{zi),{x,y)  E  5,and(a5,y) 

Let  Xr  and  be  the  sample  mean  and  variance  of 
Wr(2^),  »  =  1,  define  Ar  =  Then 

Cheng  and  Johnson  show  that 


«(/|g)  = 


N 

N-1 


X 


is  a  consistent  estimator  of  <x{f\g)  under  the  as¬ 
sumption  that 


f{i)di 


Jb,(z) 


dt  -|-  o 


(The  optimal  estimation  of  (x{f\g)  is  the  subject 
of  another  study.) 

In  spatial  statistics,  the  neighborhood  Br{zi) 
is  called  a  quadrat  centered  at  and  gives  rise 
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to  the  term  “random  quadrat  sampling”,  when 
Zi  represents  the  outcome  of  a  random  variable. 
Random  quadrat  sampling  requires  the  definition 
of  a  proximity  measure.  Interestingly,  proximity 
measures  do  not  figure  into  the  definition  of  RACs, 
but  often  enter  into  the  picture  when  RACs  are  be¬ 
ing  estimated.  With  a  consistent  estimator  now  in 
hand,  all  that  remains  is  to  clarify  how  one  defines 
quadrat  sampling  with  respect  to  densities  /  and 
h. 

As  in  the  example  that  follows,  usually  one 
has  proximity  measures  di  and  d2  associated  with 
X  and  Y  and  must  construct  a  proximity  measure 
d  for  (X,  Y).  There  are  many  ways.  The  follow¬ 
ing  definition  is  convenient  from  a  computational 
standpoint 

(*2,!te))  =  max[(ii(*i,a!2),d2(yi,y2)]- 

(1) 

To  illustrate  this  convenience,  write  z  =  (u,t;) 
where  u  and  v  and  ki  and  fe2-dimensional  vec¬ 
tors.  Define  Br,i{u)  =  {x\di{x,u)  <  r}  and 
•Sr,2(^)  =  {x\d2{x^u)  <  r}.  Now  consider  our 
problem  in  estimating  the  numerator  of  6{X^Y). 
We  must  count  the  number  of  points  in  S  which 
fall  in  the  quadrat  when  half  of  the  time  the  cen¬ 
ter  of  the  quadrat  is  drawn  according  to  the  joint 
density  /  and  the  other  half  of  the  time  the  cen¬ 
ter  is  drawn  from  the  product  density  h.  In  either 
case,  the  cardinality  nr{zi),  i  =  1, ...,  m,  when  d  is 
defined  by  equation  1,  is  simply  the  cardinality  of 
the  intersection  of  the  following  two  sets: 

{(aj.y)!*  e  ,(a!,y)  G  5,aiid* 

and 

{(».  y)|y  €  Br,2{vi)  ,{x,  y)  6  s , andy 

We  assuxe  that  Zi,  Zi  =  (ui,Vi),  is  drawn  accord- 
ing  to  /,  by  drawing  at  random  from  5.  We 
assure  that  Zi  is  drawn  according  to  h  by  drawing 
Ui  at  random  from  the  set  y)  €  5}  and  then 
drawing  Vi  at  random  from  the  set  {y|(«,  y)  €  5}. 

5.  An  Example 

There  is  an  increasing  use  of  molecular  sim¬ 
ilarity  measures  in  the  pharmaceutical  industry 
(Johnson  and  Maggiora,  1990).  Similarity  search¬ 
ing  is  a  frequent  application  in  which  one  searches 


a  large  databases  of  molecular  structures  for  struc¬ 
tures  similar  to  some  query  structure  of  pharma¬ 
ceutical  interest.  The  desire  is  to  find  related  com¬ 
pounds  in  the  database  which  might  also  be  ex¬ 
pected  to  be  of  related  interest.  See  Willett  (1987) 
for  detailed  coverage  of  the  issues  and  many  of  the 
proximity  measures  being  used  in  this  regard.  One 
expects  most  of  these  proximity  measures  to  be 
highly  related.  In  this  example,  we  study  the  re¬ 
lationship  between  two  proximity  measures,  topo¬ 
logical  index  (TI)  distance  and  fragment  represen¬ 
tation  (FR)  similarity,  used  at  our  company  for 
fast  similarity  searching. 

In  mathematical  chemistry,  a  topological  in¬ 
dex  is  simply  a  number  calculated  on  the  bonding 
structure  of  a  molecule.  A  simple  count  of  the 
number  of  atoms  serves  as  an  example  of  a  topo¬ 
logical  index,  although  most  topological  indices 
are  considerably  more  sophisticated.  The  repre¬ 
sentation  for  our  TI  distance  is  the  first  10  princi¬ 
pal  components  of  90+  topological  indices  (Basak, 
et  al.,  1988).  The  TI  distance  is  simply  the  Eu¬ 
clidean  distance  in  IR^®.  Our  fragment  represen¬ 
tation  of  a  molecular  structure  is  a  binary  vec¬ 
tor  X  in  which  each  bit  represents  the  presence  or 
absence  of  at  least  one  structural  fragment  (con¬ 
nected  substructure)  in  a  fragment  group.  Over 
300  groups  of  fragments  are  used.  The  similarity 
measure  is  the  Jacard  coefficient  (usually  called 
the  Tanimoto  coefficient  in  chemistry)  defined  by 
x!y/{xx  +  y'y  —  x'y). 

These  two  proximity  measures  are  highly  re¬ 
lated  although  it  may  not  be  immediately  appar¬ 
ent  from  the  disparity  in  the  forms  of  the  infor¬ 
mation  captured  by  their  underljring  vector  repre¬ 
sentations.  This  relatedness  becomes  immediately 
apparent  when  one  performs  similarity  searches 
using  a  common  query  structure.  If  the  common 
query  structure  is  a  prostaglandin  (a  particular 
class  of  molecular  structures),  all  of  the  most  sim¬ 
ilar  compounds  in  the  databases  by  either  prox¬ 
imity  measure  will  be  prostaglandins;  if  the  com¬ 
mon  query  structure  is  a  benzodiazepine,  all  of  the 
most  similar  compounds  will  be  benzodiazepines, 
etc.. However,  such  notions  of  relatedness  between 
the  two  proximity  measures  presupposes  an  abil¬ 
ity  to  define  classes  of  compounds.  Moreover,  any 
basis  of  quantifying  relatedness  using  these  classes 
would  reflect  the  idiosyncrases  of  the  classification 
criteria. 

Before  illustrating  the  ^-dependence  measure. 
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it  is  informative  to  look  at  some  counts  employed 
in  its  computation.  We  selected  600  query  struc¬ 
tures  at  random  from  our  database  of  over  100,000 
structures.  Both  TI  and  FR  similarity  searches 
were  performed  for  each  query  structure.  For  each 
query  structure,  we  recorded  the  number  of  struc¬ 
tures,  excluding  the  query  structure,  in  each  simi¬ 
larity  neighborhood  (quadrat)  as  well  as  in  the  in¬ 
tersection  of  the  neighborhoods.  In  this  way,  600 
3-tuples  of  counts  were  generated.  All  600  simi¬ 
larity  searches  used  a  fixed  cut-off  value  for  the  TI 
distance  and  another  fixed  cut-off  value  for  the  FR 
similarity.  The  experiment  was  then  repeated  for 
the  same  600  query  structures,  but  with  different 
cutoff  values  for  the  two  proximity  measures.  In 
the  following  discussion,  only  the  results  for  a  cut¬ 
off  value  of  0.25  for  the  TI  distance  and  for  a  cutoff 
value  of  0.97  for  the  FR  similarity  are  presented. 

Our  first  surprise  was  the  complete  lack  of 
correlation  seen  in  Figure  2  between  the  pairs  of 
counts  based  on  the  TI  and  FR  proximity  mea¬ 
sures.  Since  the  600  neighborhoods  for  each  prox¬ 
imity  measure  share  a  common  cut-off  value  or 
radius,  one  expects  a  high  count  to  reflect  a  re¬ 
gion  (defined  by  the  position  in  space  of  the  query 
structure)  with  a  relatively  high  value  for  the 
density  function.  Let  /tj  and  fpR  denote  the 


density  functions  associated  with  how  the  struc¬ 
tures  are  positioned  in  space  under  the  TI  and 
FR  representations.  Figure  2  suggests  two  things. 
First,  for  both  densities,  by  far  the  largest  pro¬ 
portion  of  density  is  associated  with  a  very  low 
density  value,  but  occasionally  one  encounters 
an  extremely  dense  region.  Second,  let  TI{z) 
and  FR(z)  denote  the  TI  and  FR  representa¬ 
tions  of  structure  z.  Then  the  random  vari¬ 
able  fTi{TI[Z))  has  virtually  no  correlation  with 
fFR{^R{Z))  where  Z  denotes  a  randomly  selected 
structure. 

At  first,  this  second  finding  toteilly  surprised 
us.  However,  the  apparent  lack  of  correlation 
between  the  random  variables  /r/(TJ(.Z’))  and 
fFR{FR{Z))  does  not  imply  a  lack  of  correlation 
between  TI{Z))  and  FR{Z).  To  see  this,  imag¬ 
ine  a  transformation  that  differentially  stretches  a 
space  on  which  a  density  function  is  defined  with¬ 
out  seriously  altering  neighboring  relationships. 
Such  a  transformation  would  preserve  the  con¬ 
tiguous  positioning  of  structures  within  a  struc¬ 
tural  class  by  both  proximity  measures  while  at 
the  same  time  allowing  the  two  proximity  mea¬ 
sures  to  differ  in  how  they  “stretched  out”  the 
regions  defining  each  structural  class. 
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Although  Figure  2  does  not  suggest  any  cor- 
relation  between  the  low  and  high  dense  regions 
under  TI  distance  with  the  low  and  high  dense 
regions  under  PR  similarity,  it  is  not  difficult  to 
establish  that  their  neighboring  relationships  are 
related  using  the  counts  of  the  number  of  struc¬ 
tures  in  intersections  of  particular  pairs  of  neigh¬ 
borhoods.  For  example,  the  point  in  Figure  2  with 
coordinates  (24,9)  corresponds  to  a  pair  of  neigh¬ 
borhoods  whose  intersection  contains  9  structures, 
i.e.  the  FR-similarity  neighborhood  is  a  subset  of 
the  Tl-distance  neighborhood.  Suppose  that  the 
structures  are  distributed  in  “FR  space”  indepen¬ 
dently  of  their  distribution  in  “TI  space”.  If  we 
had  100,000  structures  in  the  database,  the  prob¬ 
ability  a  randomly  selected  structure  would  fall  in 
this  TI  neighborhood  is  25/100,000  and  the  cor- 
ressponding  probability  for  the  FR  neighborhood 
is  9/100,000.  If  these  two  events  are  independent, 
the  probability  of  a  randomly  selected  structure 
falling  in  the  intersection  is  the  product  of  these 
two  probabilities.  It  follows  that  the  expected 
number  of  counts  associated  with  two  randomly 
selected  neighborhoods  of  this  size  is  roughly  es¬ 
timated  by  25  X  9/100,000  =  0.00225.  One  can 
view  the  intersection  counts  as  a  Poisson  random 
variable  with  mean  0.00225.  Thus,  our  seeing  an 
intersection  count  of  9  is  extremely  improbable 
under  the  assumption  that  the  random  variables 
fTi{TI{Z))  and  fFR{FR{Z))  are  independent. 

Although  interesting,  this  particular  test  does 
not  provide  a  calibrated  measure  of  dependence. 
For  that  we  turn  to  the  8  coefficient  calibrated  in 
Figure  1.  With  /n  and  fpR  playing  the  roles  of 
/i  and  fi  in  the  preceding  section,  we  obtain  the 
estimates  and  confidence  intervals  for  the  RACs 
given  in  Table  1.  A  sense  for  the  histogram  of  the 
counts  making  up  the  two  self-aggregation  coeffi¬ 
cients  can  be  obtained  from  Figure  2.  The  dis¬ 
tribution  of  the  intersection  counts  for  the  joint 
density  fpRxTi  is  given  by 


count  0  1  234569 

freq  525  57  7  5  2  2  1  1 


All  600  counts  in  which  the  product  density  was 
the  design  density  were  zeros.  These  were  pooled 
with  the  preceding  600  intersection  counts  when 
estimating  Q:(/pnxTi|^)-  The  bootstrap  confi¬ 
dence  intervals  were  developed  from  500  bootstrap 
samples  from  the  sample  quantile  function  of  the 
observed  counts. 


Table  1 


RAC 

Estimate 

95%  Cl 

Oi(fFR) 

14.71 

(12.3,  17.2) 

3.95 

(3.65,  4.25) 

SifpRxTl) 

8.12 

(5.7, 10.5) 

^{fFRxTllg) 

16.2 

(11.3,  21.2) 

SifpRxTl) 

2.04 

(1.31,  2.94) 

It  is  easily  shown  that  if  frRxTi  =  /fr  x  /ti» 
then  a{fFRxTi)  =  o^ifFn)  x  a{fTi).  Clearly, 
this  is  not  the  case,  although  we  are  still  un¬ 
sure  of  the  meaning  and  significance  of  the  fact 
that  a{fFRxTi)  is  so  much  less  than  the  prod¬ 
uct  of  the  self-aggregation  coefficients.  However, 
<^{fFRxTl\9)  is  twice  that  of  a{fFRxTi)^  giving 
an  estimate  of  two  for  S{FR{Z),TI{Z)).  Based 
on  the  calibration  of  Figure  1,  there  is  an  extreme 
dependence  FR{Z)  and  T7(.Z'), 
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Abstract 

Robust  empirical  and  hierarchical  Bayes  estimators  for 
exchangeable  normal  means  with  heterogeneous  vari¬ 
ances  are  developed.  The  robust  empirical  Bayes  estima¬ 
tor  is  obtained  by  using  robust  (or  hierarchical)  priors 
and  the  Newton-Ralphson  algorithm.  The  robust  hier¬ 
archical  Bayes  estimator  is  developed  through  t  (partic¬ 
ularly  the  Cauchy)  priors  with  the  computation  being 
performed  through  the  Gibbs  sampler.  It  is  shown  that 
such  robust  estimators  preserve  the  gain  of  shrinkage 
in  the  presence  of  extreme  individual  component  esti¬ 
mators.  Efron  and  Morris’s  classic  example  of  estimat¬ 
ing  the  toxoplasmosis  prevalence  rates  is  reconsidered. 
The  method  is  then  applied  to  the  estimation  of  rates 
of  change  in  longitudinal  studies  and  is  illustrated  with 
an  example.  Further,  the  estimators  are  compared  with 
those  obtained  via  BLUP  estimators  of  the  random  ef¬ 
fects  in  SAS  PROC  MIXED  through  a  simple  random 
coefficient  growth  curve  model. 


1  INTRODUCTION 

With  the  recent  development  of  computational  tools 
such  as  the  Gibbs  sampler,  complex  data  can  be  an¬ 
alyzed  through  a  comprehensive  Bayesian  hierarchical 
model.  In  this  paper,  however,  we  consider  some  im¬ 
portant  estimation  properties  in  the  basic  model  of  esti¬ 
mating  exchangeable  normal  means  (or  random-effects), 
such  as  the  the  estimator’s  robustness  with  respect  to 
prior  misspecifications  and  outlying  observations.  Such 
estimation  is  often  needed  in  practice,  as  is  demonstrated 
in  Morris  (1983),  Breslow  (1990)  and  Louis  (1991),  and 
can  be  summarized  as  estimating  Ar**?  Pk  simulta¬ 
neously  starting  with  their  independent  unbiased  esti¬ 
mators  &i,. hk-  Often  it  is  assumed  that  bi  \  ft  ^ 


JV(ft-,d£),  i  =  l,---5  ^5  independently.  It  is  now  well 
known  that  shrinkage  estimators  (Morris,  1983)  can  gen¬ 
erally  improve  upon  the  usual  maximum  likelihood  esti¬ 
mator  (bi)  for  ft*  in  terms  of  achieving  smaller  squared  er¬ 
ror  risk.  And  the  shrinkage  estimators  are  often  derived 
from  a  Bayes  approach  by  assuming  that  the  ft  ’s  are 
from  a  certain  probabilistic  distribution.  This  method 
has  been  shown  to  be  useful  in  problems  where  the  sci¬ 
entific  objectives  were  not  directly  one  of  simultaneous 
estimation,  e.g.,  it  provides  a  way  to  correct  for  the  effect 
of  regression  to  the  mean  and  gives  estimators  of  regres¬ 
sion  coefficients  which  yield  uniformly  smaller  prediction 
mean  square  error  in  linear  and  logistic  regression  (Co- 
pas,  1983);  and  it  also  gives  estimators  with  uniformly 
smaller  variances  in  discrete  event  simulation  with  con¬ 
trol  variates  (Tan  and  Gleser,  1992)  and  estimators  of 
common  odds  ratio  using  concordant  pairs  (Liang  and 
Zeger,  1988). 

In  the  simpliest  case  when  d?  =  (r^  for  all  i  =  1,. . . , 
Ar,  bi  I  ft'  ^  N{^iyCr^)  with  a  conjugate  (Gaussian)  prior 
ft  ^  N{Py  A),  the  shrinkage  estimator  of  Pi  for  i  ==  1,. . . , 
k  as  proposed  in  Morris  (1983)  is  of  the  form 


ft  =  bi  -  min 


fe-3  (fc  -  3)^^ 


(6.-6),  (1.1) 


where  b  is  the  grand  mean  of  the  6i’s.  This  estimator  has 
smaller  squared  error  risk  than  the  maximum  likelihood 
estimator,  provided  that  fc  >  4.  When  the  variances  d? 
are  not  equal,  an  iterative  algorithm  is  needed  to  cal¬ 
culate  the  empirical  Bayes  estimator.  Tan  and  Gleser 
(1992)  have  studied  the  magnitude  of  potential  improve¬ 
ment  of  these  estimators.  The  gain  would  be  substantial 
if  the  individual  means  are  reasonably  similar. 

However,  the  conjugate  priors  are  not  necessarilly  ro¬ 
bust  (with  respect  to  possible  misspecifications  of  pri- 
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ors).  In  fact,  as  pointed  out  in  Berger  (1985,  Chapter  4), 
when  the  likelihood  function  is  concentrated  in  the  tail 
of  the  prior  distribution,  conjugate  priors  should  prob¬ 
ably  be  avoided.  Although  the  usual  normal  conjugate 
prior  for  estimating  the  normal  means  is  robust  within 
the  class  of  all  prior  distributions  with  finite  first  two 
moments  (Morris,  1983),  the  moments  depend  on  the 
tail  of  the  distribution  and  are  thus  highly  variable.  For 
instance,  two  priors  may  be  virtually  indistinguishable 
but  may  have  quite  different  moments  (Berger,  1985), 
and  some  highly  robust  priors  (such  as  the  Cauchy  pri¬ 
ors)  do  not  have  moments.  Sometimes  the  estimator  us¬ 
ing  the  conjugate  Gaussian  prior  (such  as  1.1)  has  been 
referred  to  as  being  robust  in  a  conservative  sense  in  that 
if  the  prior  is  fully  wrong  or  if  one  6,*  is  outlying  (thus  in 
violation  of  exchangeability),  the  empirical  Bayes  (EB) 
estimates  would  collapse  back  to  the  usual  maximum 
likelihhod  estimators,  resulting  in  no  harm  but  nullify¬ 
ing  the  potential  gain.  Therefore  estimators  that  pre¬ 
serve  the  gain  of  shrinkage  in  the  presence  of  outlying 
components  are  very  appealing  because  the  rest  of  the 
components  (the  individual  /?,’s  )  can  still  benefit  from 
borrowing  strength  from  the  ensemble.  This  refined  ro¬ 
bustness  can  be  achieved  by  using  flat-tailed  priors  or  by 
hierarchical  modeling  (Berger,  1985,  Angers  and  Berger, 
1991).  When  the  variances  are  not  equal,  the  first  stage 
parameter  estimators  in  the  hierarchical  model  can  be 
derived  from  Angers  (1992)  when  the  degrees  of  free¬ 
dom  of  the  i-prior  is  odd.  In  general  when  the  variances 
are  heterogeneous,  analytic  solutions  with  t-priors  seem 
extremely  difficult  to  obtain.  With  the  computation  be¬ 
ing  performed  via  Gibbs  sampler,  such  robust  estimates 
of  random  effects  can  be  easily  extended  to  the  general 
mixed-effects  model  of  Laird  and  Ware  (1982). 


The  purpose  of  this  paper  is  to  develop  empirical  and 
hierarchical  Bayes  estimators  with  the  refined  robust¬ 
ness.  The  class  of  t-priors  (the  Cauchy  prior  in  particu¬ 
lar)  is  used  to  obtain  the  robust  heirarchical  estimators 
with  computation  being  performed  using  the  Gibbs  sam¬ 
pler  (Geman  and  Geman,  1984,  Gelfand  et  al.,  1990). 


As  a  quicker  alternative,  we  first  use  the  robust  prior 
in  Berger  (1985)  to  derive  robust  empirical  Bayes  esti¬ 
mators  through  use  of  the  Newton-Ralphson  algorithm 
in  §2,1.  Hierarchical  Bayes  modeling  via  the  Gibbs  sam¬ 
pler  is  considered  in  §2.2.  A  data  set  from  the  literature 
(Efron  and  Morris,  1975)  is  reconsidered  in  §2.3.  In  §3, 
the  method  is  applied  to  longitudinal  studies  where  es¬ 
timation  of  the  rates  of  individual  change  is  of  interest 
and  is  illustrated  with  a  real  life  example.  The  paper  is 
concluded  with  a  discussion  in  §4. 


2  ROBUST  ESTIMATES 


2.1  Robust  empirical  Bayes  estimates 

The  robust  prior  developed  in  Berger  (1985)  is  based 
on  the  consideration  of  the  admissibility  of  the  Bayes 
estimators  (Strawderman  and  Cohen,  1971).  Using  this 
prior,  the  model  can  be  given  as 

bi  ~  d?),  and  Pi  ~  iV(/i, B{\i)),  (2.1) 

where  B{Xi)  =  (d?  +  A)/{2X{)  —  d?,  and  A,-  has  density 
7r(A,)  =  0.5\/A;/(o,i)(-^i)*  Given  fi  and  A,  the  posterior 
mean  and  variance  of  pi  are: 


(2.2) 


2di 


1 

f  2116, f  A 

1  1 

[ell*<ll’  - 1  ' 

i 

to 

where  ||6,|p  =  {bi  —  /i)^/(df  -f  A).  The  marginal  distri¬ 
bution  of  bi  is 


m(5,|/i,A)  = 


1 


2\/7f  -h  A  ll^ill 


Parameters  /i  and  A  can  be  estimated  using  the  maxi¬ 
mum  likelihood  method  via  the  Newton-Ralphson  algo¬ 
rithm. 

Another  advantage  of  the  above  estimator  is  that 
it  easily  yields  subjective  hierarchical  Bayes  estimates 
(Berger  and  Robert,  1990)  for  any  plausible  /i  and  A. 
However  this  prior  should  be  used  with  caution.  It  may 
cause  the  estimator  to  collapse  back  to  bi  when  A/d!f  is 
too  big.  In  other  words,  the  prior  may  be  so  flat  such 
that  its  effect  on  the  estimators  is  essentially  the  same 
as  that  of  a  uniform  (noninformative)  prior. 


2.2  Robust  Hieraxchical  Bayes  Estimate 

The  robust  hierarchical  Bayes  estimate  has  many  ad¬ 
vantages  over  the  empirical  Bayes  estimate  (Berger  and 
Robert,  1990).  A  main  advantage  is  that  it  takes  into 
account  the  error  due  to  the  estimation  of  the  hyperpa¬ 
rameters,  whereas  the  empirical  Bayes  method  ignores 
such  error.  Another  advantage  is  that  in  the  hierarchical 
model  the  marginal  posterior  distributions  can  be  esti¬ 
mated  via  Gibbs  sampling  (Gelfand  et  al,  1990).  Thus, 
standard  errors  and  confidence  intervals  can  be  devel¬ 
oped  easily. 

As  shown  in  Berger  (1985,  pages  195-196),  a  Cauchy 
prior  is  more  reasonable  in  terms  of  the  posterior  robust¬ 
ness  and  Bayesian  risk  if  we  are  uncertain  as  to  which 
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priors  best  describe  our  prior  belief.  We  now  consider 
the  following  hierarchical  model 

hi  1/3,-  ~  iV(/?i ,df),  and  ~  h {n,a^,Vo) 

N{n,C),  rsj  Gamma{p^q),  (2.3) 

where  the  (multivariate)  ^-distribution,  with  location  pa¬ 
rameter  /i  and  scale  matrix  a^I  and  dimension  fc,  de¬ 
noted  by  w  ^  ^jfc(/i,<r^/,Uo),  has  density  of  the  form: 

2  _ 1 _ 

/4u|/i,<7  ,Uo)-  ^ 

where  Uo,cr^  >  0,^(uo)  =  const.  Of  particular  interest 
are  the  two  special  cases;  1)  if  uq  =  1,  k  =  1,  then 
/i(u|/i,  a^)  is  the  Cauchy  prior  with  median  /i  and  quar- 
tiles  /i±-4;  2)  and  if  uq  =  oo,  /ib(u|/i,o-^)  =  Nk{fi,(r^Ip)y 
is  the  Gaussian  prior.  Since  the  t-distribution  is  a  mix¬ 
ture  of  the  Gaussian  and  inverse  gamma  distributions, 
all  conditional  distributions  used  in  the  Gibbs  sampling 
have  closed  forms  and  thus  the  algorithm  is  very  efficient. 
In  fact,  the  t-distribution  can  be  decomposed  into 

v.\t^  ~  Np(n,T^I),  and  ~ 


where 

IG{vo/2,uo/2)  =  («o/2)*'“/2e-“<'/2‘'v-(‘’<>/2+i)r-i(uo/2) 


is  the  density  function  of  the  inverse  Gamma  distribu¬ 
tion.  So  all  the  conditional  distributions  are  given  as 
follows: 


[pi\T\fty,ibi)]r.Ni 


it>i  + 


d? 


d?r2 

j2  I  _2  )’ 


d?+r2  +  +  r 


Then  the  Gibbs  sampling  can  be  applied  to  the  hierar¬ 
chical  model  specified  in  (2.1).  Given  the  data  (6,*),  one 
can  obtain  the  needed  marginal  distribution  (say  7r(^i|6,) 
)  from  the  Gibbs  sampling. 

A  comparison  between  empirical  and  hierarchical 
Bayes  estimators  is  given  in  Kass  and  Steffey  (1989)  in 
which  approximations  of  the  posterior  variances  are  also 


+  Vo<T^  . 

C  - 

+  A;r2  + 

-) 

■</> 


fcT2+C'' 
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given.  Applying  the  approximation  to  the  model  given 
by  equations  (2.1)  and  (2.3),  one  can  see  that  the  addi¬ 
tional  term  needed  to  take  into  account  the  estimation 
of  the  hyperparameters  /i  and  A  in  (2.1)  and  p  and  a 
in  (2.3)  is  of  order  0(l/[n?fc])  while  the  main  term  is 
of  order  O(nf^).  Consequently,  equation  (2.2)  or  the 
conditional  variance  based  on  7r(/3,  |(iata,  A)  is  a  good  ap¬ 
proximation  of  the  posterior  variance  when  n,*  is  rela¬ 
tively  small  and  k  is  large.  In  this  case  the  empirical 
Bayes  estimators  can  serve  as  an  adequate  approxima¬ 
tion  to  those  obtained  from  the  hierarchical  model.  How¬ 
ever,  the  robust  hierarchical  Bayes  model  gives  estimates 
which  are  resistant  to  both  misspecification  of  the  prior 
and  outlying  component  estimates,  as  mentioned  earlier. 


2.3  Estimating  toxoplasmosis  prevalence 
rates 

We  now  consider  an  example  taken  from  Efron  and  Mor¬ 
ris  (1975),  in  which  the  prevalence  rates  of  toxoplasmosis 
in  36  El  Salvadorian  cities  were  estimated.  The  preva¬ 
lence  rates  in  Table  1  are  standardized  and  the  variances 
are  known  from  the  binomial  distribution,  and  differ  be¬ 
cause  of  unequal  nmnber  of  patients  sampled  in  different 
cities.  Table  1  gives  the  robust  empirical  and  hierarchical 
Bayes  estimates,  as  well  as  the  empirical  Bayes  estimates 
developed  in  their  paper  as  a  comparison.  The  maximum 
likelihood  estimate  via  the  Newton-Ralphson  algorithm 
converged  at  /i  =  0.024,  and  A  =  3.26  after  55  iterations 
starting  from  the  mean  and  variance  of  the  36  prevalence 
rates.  The  Gibbs  sampling  algorithm  converged  roughly 
with  160  cycles  of  m  =  50  drawings  in  that  there  was  lit¬ 
tle  change  in  the  successive  posterior  distributions  there¬ 
after  at  200, 240  cycles.  In  fact,  the  change  in  quartiles 
of  the  posterior  distributions  was  less  than  10“®.  The 
initial  values  were  =  —0.0419,  Co  =  12,  po  =  0.2, 
and  go  =  0.001,  indicating  rather  vague  prior  knowledge 
about  these  parameters.  It  seems  in  this  example  that 
the  normal  prior  is  indeed  quite  robust,  as  the  robust  hi¬ 
erarchical  Bayes  estimators  and  Efron  and  Morris’s  em¬ 
pirical  Bayes  estimators  are  very  similar  except  for  only 
a  few  cities  in  which  the  prevalence  rates  are  more  at 
the  extremes.  This  similarity  is  what  is  expected  of  the 
(refined)  robust  hierarchical  estimators.  The  robust  em¬ 
pirical  Bayes  estimates,  however,  are  essentially  the  same 
as  the  original  estimated  prevalence  rates.  It  is  probably 
too  conservative  in  that  the  information  between  cities 
did  not  add  any  new  information  about  the  prevalence 
rates.  In  fact,  min(A/d?)  =  655  is  quite  large  in  this 
case. 
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3  RATES  IN  LONGITUDINAL 
STUDIES 


Often  in  longitudinal  studies  the  rate  of  chatnge  of  the  re¬ 
sponse  over  time  is  of  primary  interest,  and  such  change 
is  often  approximately  linear  (possibly  after  some  trans¬ 
formations,  and/or  over  a  short  period  of  follow-up). 
For  instance,  the  decline  of  lung  and  renal  functions 
is  linear  for  certain  patient  populations.  In  this  case, 
it  is  reasonable  to  reduce  the  data  to  slopes  and  their 
standeird  deviations  by  linear  regression  for  each  indi¬ 
vidual  (Hui  and  Berger,  1983).  Because  the  subjects 
imder  study  share  some  common  characteristics  (belong¬ 
ing  to  a  certain  population),  it  is  reasonable  to  assume 
that  their  individual  rates  of  change  come  from  a  com¬ 
mon  probability  distribution.  Consequently,  shrinkage 
estimators  of  the  individual  rates  are  desirable  (Morris, 
1983).  This  approach  ignores  the  intercept  and  thus  loses 
some  information  in  comparison  with  the  the  Gaussian 
random  effects  model  (Laird  and  Ware,  1982)  or  more 
generally  a  repeated  measures  model  of  Jennrich  and 
Schluchter(1986)  which  allows  the  modelling  of  various 
within-subject  correlation  structures. 

We  now  consider  a  prospective  study  in  ophthalmology 
where  intraocular  gas  was  used  in  complex  retinal  surg¬ 
eries  to  provide  internal  tamponade  of  retinal  breaks  in 
the  eye.  An  important  issue  was  to  estimate  the  kinetics 
(e.g.,  decay  rate,  half-life,  and  so  on)  of  the  disappear¬ 
ance  of  the  gas.  After  gas  was  injected  into  their  eyes, 
31  patients  were  seen  three  to  eight  (average  of  5)  times 
over  a  three-month  period,  and  the  volume  of  the  gas  in 
their  eyes  was  recorded. 

Let  yij  be  the  j*’'  gas  volume  for  the  individual 
at  day  Xij.  Some  initial  analysis  suggested  that  the  vol¬ 
ume  (in  percent)  of  the  intraocular  expansile  gas  {CzFs) 
decreases  slowly  in  the  first  few  days  after  maximal  ex¬ 
pansion,  then  it  decreases  more  rapidly  and  finally  more 
slowly  (producing  an  S-shaped  curve).  Thus  a  logit 
transformation  was  first  made  on  the  gas  volume: 


Zij  —  log 


(  yo  +0.05  \ 
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where  0.05  was  added  to  avoid  zero  denominators.  Then 
a  linear  model  can  be  assumed: 


Zij  —  OLi  Pi^ij  “h 

where  is  the  Gaussian  error  term  with  mean  0  and 
variance  cr?.  Separate  linear  regressions  using  each  sub¬ 
ject’s  data  are  used  to  obtain  the  quantities: 


bi  ~  Nipudj),  d?  = 


(3.1) 


and 


c2  ^ 


where  df  is  the  usual  variance  estimate  of  the  slope  bi 
and  O'?  is  estimated  from  the  sf.  Hui  and  Berger  (1983) 
also  use  the  empirical  Bayes  estimates  of  o'? ’s  as  a  com¬ 
promise  between  s?  / (n,*— 2),  the  individual  estimate,  and 
Ss? /Sn,-,  the  pooled  estimate.  Both  estimates  are  inde¬ 
pendent  of  the  6,’s.  However,  we  only  use  the  individual 
estimates  of  o'?  to  illustrate  the  method. 

The  goal  is  to  find  an  improved  estimator  for  each  in¬ 
dividual  decay  rate  to  get  a  better  idea  of  the  variabil¬ 
ity  of  these  rates.  Robustness  considerations  are  partic¬ 
ularly  relevant  here  because  previous  studies  suggested 
the  gas  decay  rate  was  highly  variable  (Meyers  et  al., 
1992).  A  Cauchy  prior  with  median  n  and  quantiles 
fi±<T  seems  to  be  plausible.  Further,  a  normal  hyper¬ 
prior  on  and  a  gamma  prior  on  is  used.  The  ini¬ 
tial  values  were  given  by  J70  =  —0.08,  Co  =  12,  po  = 
0.2,  go  =  0.0001  indicating  a  rather  vague  prior  knowl¬ 
edge  about  these  parameters  was  assumed.  Different 
starting  values  were  used  for  different  cycles  (iterations). 
The  convergence  was  achieved  roughly  with  240  cycles  of 
m  =  40  drawings  in  that  there  was  little  change  in  the 
histograms  of  the  posterior  distributions  thereafter  at 
240,300,360,400,420  cycles.  The  changes  in  quartiles 
were  less  than  10“®.  The  decay  rates  were  estimated 
based  on  the  data  after  420  cycles  of  iterations.  The 
robust  hierardiical  model  gives  improved  estimators  of 
the  decay  rates  and  their  standard  errors  by  borrowing 
strength  from  the  ensemble  and  thus  provides  a  more 
accurate  picture  of  the  variation  of  the  individual  gas 
decay  rates.  This  can  be  more  clearly  shown  by  looking 
at  the  plot  of  these  rates  over  the  cases  (not  shown  here). 
Table  2  gives  the  least  square  slopes,  RHB  estunates  and 
their  standard  errors  and  a  90  %  confidence  interval  for 
each  individual  decay  rate. 

In  this  data  set,  the  decay  rate  for  case  30  is  outlying, 
being  beyond  1.5  times  the  interquartile  range.  We  have 
foimd  that  the  RHB  estimates  are  quite  close  to  those 
obtained  when  the  outlier  is  removed  (see  Table  2).  Thus 
our  estimate  is  indeed  quite  robust  with  respect  to  out¬ 
lying  rates. 

Finally  we  fitted  a  random  coefficient  growth  curve 
model.  The  estimated  best  linear  unbiased  predictors 
(BLUPs)  of  the  individual  rates  of  decline  are  obtained 
using  SAS  PROC  MIXED.  Since  these  estimators  are  in 
fact  shrinkage  estimators  of  the  slopes  using  normal  pri¬ 
ors,  they  may  not  have  the  refined  robustness  (  with  re¬ 
spect  to  outlying  individual  slopes)  and  could  give  BLUP 
estimators  which  are  more  or  less  the  same  as  the  orig¬ 
inal  least  square  slopes.  Thus  the  possible  gain  of  using 
the  random  effects  model  is  diminished.  This  indeed  ap¬ 
pears  to  be  the  case  (see  Table  2). 
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RHB=robi]St  hierardbical  Bayes  estimates; 

EB=Empirical  Bayes  estimates  from  Efron  and  Morris  (1975). 
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4  DISCUSSION 

In  summary,  we  used  hierarchical  modeling  (through  use 
of  a  Cauchy  prior  in  particular)  to  obtain  estimators  of 
normal  means  (or  random  effects)  which  are  robust  with 
respect  to  prior  misspecifications  and  outlying  individ¬ 
ual  means.  Thus  the  gain  of  shrinkage  is  preserved.  It 
is  worth  pointing  out  that  the  same  problem  with  Gaus¬ 
sian  priors  persists  in  the  linear  mixed-effects  models  of 
Laird  and  Ware  (1982)  for  analyzing  longitudinal  data. 
The  effect  of  assuming  a  Gaussian  random  effect  is  that 
the  potential  advantage  of  a  random-effects  model  may 
Vcinish  simply  because  of  one  outlying  individual’s  ran¬ 
dom  effects.  It  is  however  easy  to  incorporate  the  robust 
priors  studied  in  this  paper  into  these  models  if  the  com¬ 
putation  is  performed  using  the  Gibbs  sampler  as  is  in 
Gilks  et  al.(1993). 
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Abstract 

Recursive  partitioning  methods,  also  known  are  tree  or 
CART  models,  have  been  applied  to  several  kinds  of 
data,  including  the  cases  where  the  response  y  is  a  con¬ 
tinuous  variable,  a  category  or  class,  a  survival  time,  and 
a  longitudinal  response  pattern.  In  this  work  we  extend 
the  methods  to  the  prediction  of  an  observed  response 
rate  (number  of  events)/(time  observed).  The  building 
and  ordering  of  a  tree  model  work  well,  but  there  are 
some  open  issues  in  cross-validation  of  the  final  model. 
Finally,  some  connections  are  noted  to  other  work  on 
trees  for  survival  data. 


1  Introduction 

Recursive  partitioning  is  a  method  for  growing  binary 
decision  trees,  where  each  node  or  split  represents  a  de¬ 
cision,  e.g.,  go  to  the  left  if  age  <  40,  and  the  termi¬ 
nal  leaves  give  the  predicted  values.  These  methods 
date  back  to  the  AID  (Automatic  Interaction  Detec¬ 
tion)  program  developed  by  Morgan  and  Sonquist  in  the 
early  1960s,  and  received  a  strong  theoretical  boost  with 
the  CART  (Classification  and  Regression  Trees)  work  of 
Brieman,  et.al.  in  the  1980s  [1].  A  famous  example  is 
the  digit  recognition  problem. 

Consider  the  segments  of  an  unreliable  digital  readout 
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where  each  light  is  correct  with  probability  0.9,  e.g.,  if 
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Figure  1:  Optimally  pruned  tree  for  the  stochastic  digit 
recognition  data 


the  true  digit  is  a  2,  the  lights  1,  3,  4,  5,  and  7  are  on  with 
probability  0.9  and  lights  2  and  6  are  on  with  probability 
0.1.  Construct  test  data  where  Y  €  {0,1,. ..,9},  each 
with  proportion  1/10  and  the  Xiyi  =  1,...,7  are  i.i.d. 
bernoulli  variables  with  parameter  depending  on  Y.  As  — 
A24  are  generated  as  i.i.d  bernoulli  P{Xi  =  1}  =  .5,  and 
are  independent  of  Y.  They  correspond  to  embedding 
the  readout  in  a  larger  rectangle  of  random  lights.  A 
sample  of  size  200  was  generated  accordingly  and  the 
CART  procedure  applied  to  build  the  tree.  The  results 
are  shown  in  figure  1. 

Tree  methods  have  been  applied  to  regression  and 
classification  problems  [1],  survival  analysis  [3],  longi¬ 
tudinal  analysis  [6]  and  others.  The  goal  of  this  research 
is  to  extend  the  methodology  to  event  rate  data.  The 
model  in  this  case  is 


A  =  /(x) 

where  A  is  an  event  rate  and  x  is  some  set  of  predic¬ 
tors.  As  an  example  consider  hip  fracture  rates.  For 
each  county  in  the  United  States  we  can  obtain 
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•  number  of  fractures  in  patients  age  65  or  greater 
(from  Medicare  files) 

•  population  of  the  county  (US  census  data) 

•  potential  predictors  such  as 

-  socio-economic  indicators 

~  number  of  days  below  freezing 

-  ethnic  mix 

-  physicians/1000  population 

-  etc. 

Such  data  would  usually  be  approached  by  using  Pois¬ 
son  regression;  can  we  find  a  tree  based  analogue? 

2  Recursive  partitioning  ingredi¬ 
ents 

A  tree  based  method  has  four  main  ingredients 

1.  A  split  criteria.  This  is  used  to  determine  the  ^‘best” 
available  split  of  a  node  into  two  daughter  nodes. 

2.  An  impurity  criteria.  This  is  used  to  measure  the 
“homogeneity”  of  a  node,  and  is  used  to  order 
the  possible  sub- trees  (sub-models)  of  the  full  tree 
model. 

3.  Labeling:  An  “average  response”  for  each  node. 

4.  Prediction  error:  The  error  in  prediction  for  a  new 
observation,  should  it  be  predicted  using  this  node. 
This  is  needed  for  cross-validation  but  not  for  build¬ 
ing  or  ordering  the  tree. 

For  tree  based  regression,  these  are 

1.  the  between  groups  sum  of  squares, 

2.  the  within  node  sum  of  squares, 

3.  the  mean  and  variance  of  a  node, 

4.  {y-yf- 

For  tree  based  classification  there  are  several  variations. 

Choices  include 

1.  One  of 

•  the  likelihood  ratio  test  for  Fo  *  Pi  =  P2)  where 
Pi  and  p2  are  the  vector  of  proportions  in  the 
two  daughter  nodes. 

•  the  Gini  criterion 


•  the  twoing  criterion  (see  [1]) 

2.  One  of 

•  the  binomial  deviance  within  the  node 

•  the  risk  of  a  node,  based  on  priors  and  a  loss 
matrix 

3.  The  predicted  class  for  the  node,  or  the  vector  of 
class  probabilities 

4.  One  of 

•  the  prediction  loss  ^(observed  class,  predicted 
class),  where  L  is  the  loss  matrix 

•  the  predicted  contribution  to  the  deviance. 

(Many  other  choices  have  been  explored  for  this  prob- 
lem). 

In  adding  criteria  for  rates  regression  to  this  ensem¬ 
ble,  the  guiding  principle  was  the  following:  the  between 
groups  sum-of-squares  is  not  a  very  robust  measure,  yet 
tree  based  regression  works  very  well.  So  do  the  simplest 
thing  possible. 

Let  Ci  be  the  observed  event  count  for  observation  i, 
ti  be  the  observation  time,  and  =  1, . .  .,p  be  the 
predictors. 

Labels:  The  observed  event  rate  and  the  within-node 
deviance 

\  _  #  events  _ 

total  time 

Splitiing  rule:  The  likelihood  ratio  test  for  two  Pois¬ 
son  groups 

^parent  “■  (^left  son  "*■  ^right  son) 

Purity:  The  within  node  deviance. 

Prediction:  The  deviance  contribution  for  a  new  ob¬ 
servation,  using  A  of  the  node  as  the  predicted  rate. 

3  Improving  the  method 

There  is  a  problem  with  the  criterion  just  proposed,  how- 
ever:  cross-validation  of  a  model  often  produces  an  in¬ 
finite  value  for  the  deviance.  The  simplest  case  where 
this  occurs  is  easy  to  understand.  Assume  that  some 
terminal  node  of  the  tree  has  20  subjects,  but  only  1  of 
the  20  has  experienced  any  events.  The  cross-validated 
error  (deviance)  estimate  for  that  node  will  be 

...-fcilog(ci/0*ti)  +  ... 
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which  is  infinite  for  c,*  >  0.  The  problem  is  that  when 
A  =  0  the  occurrence  of  an  event  is  infinitely  improba¬ 
ble,  and,  using  the  deviance  measure,  the  corresponding 
model  is  infinitely  bad. 

One  might  expect  this  phenomenon  to  be  fairly  rare, 
but  unfortunately  it  is  not  so.  One  given  of  tree-based 
modeling  is  that  a  right-sized  model  is  arrived  at  by  pur¬ 
posely  overfitting  the  data  and  then  pruning  back  the 
branches.  A  program  that  aborts  due  to  a  numeric  ex¬ 
ception  during  the  first  stage  is  embarrassing  to  say  the 
least. 

Of  more  concern  is  that  this  edge  effect  does  not  seem 
to  be  limited  to  the  pathologic  case  detailed  above.  Any 
near  approach  to  the  boundary  value  A  =  0  leads  to 
large  values  of  the  deviance,  and  the  procedure  tends  to 
discourage  any  final  node  with  a  small  number  of  events. 

An  ad  hoc  solution  is  to  use  the  revised  estimate 


A  =  max 


where  is  1/2  or  1/6.  This  is  similar  to  the  starting 
estimates  used  in  the  GLM  program  for  a  Poisson  re¬ 
gression.  This  is  unsatisfying,  however,  and  we  propose 
instead  using  a  shrinkage  estimate. 

Assume  that  the  true  rates  for  the  leaves  of  the 
tree  are  random  values  from  a  Gamma(//,  a)  distribution. 
Set  /i  to  the  observed  overall  event  rate  and 

let  the  user  choose  as  a  prior  the  coefficient  of  variation 
k  =  <T I fjL.  A  value  of  k  =  0  represents  extreme  pessimism 
(“the  leaf  nodes  will  all  give  the  same  result”),  whereas 
Ar  =  oo  represents  extreme  optimism.  The  Bayes  esti¬ 
mate  of  the  event  rate  for  a  node  works  out  to  be 


where  a  =  1/k^  and  ^  =  a/A. 

This  estimate  is  scale  invariant,  has  a  simple  interpre¬ 
tation,  and  shrinks  least  those  nodes  with  a  large  amount 
of  information.  In  practice,  a  value  of  Ar  =  10  does  es¬ 
sentially  no  shrinkage.  All  tests  were  done  with  A:  =  1. 


4  Examples 

As  an  example,  we  consider  a  variant  of  the  digit  recogni¬ 
tion  problem.  Let  Xi  to  X7  be  the  segments  of  a  digital 
readout,  as  in  the  earlier  example,  where  each  segment 
is  in  error  20%  of  the  time.  Let  Ui  to  Uiq  and  Bi  to 
Bio  be  extraneous  predictors  with  uniform(0,l)  and  bi- 
nomial(.5)  distributions,  respectively.  The  true  class  of 
the  observations  is  evenly  divided  over  the  digits  0~9, 
but  the  true  class  is  not  observed.  Instead  we  observe 


^2  * 


5,4,8  9 

As 


Figure  2:  Rates  recognition 

a  Poisson  count  with  rate  A  =  .34  for  class  0  and  raie 
A  =  3.4  for  class  9,  the  true  rates  are  evenly  spaced  on  a 
logarithmic  scale.  The  number  of  observations  and  the 
total  time  on  test  was  varied  between  simulations. 

A  typical  tree  for  n  =  1000  and  U  ^  i7(.5, 1.5)  is 
shown  in  figure  2.  With  this  choice  for  n  and  t  there  were 
on  average  1000  events,  which  is  a  fairly  large  sample. 

Each  internal  node  of  the  tree  is  labeled  with  the  vari¬ 
able  used  to  split  at  that  node.  The  nodes  marked  with 
a  double  asterisk  are  retained  if  one  uses  the  minimum 
cross- validated  error  rule,  and  those  with  an  asterisk  are 
retained  if  the  “1  SE”  rule  is  used.  Each  leaf  is  labeled 
with  the  class(es)  that  would  be  routed  to  that  leaf  if  Xi 
were  measured  without  error;  for  some  of  the  leaves  we 
also  show  the  next  variable  that  was  chosen  by  the  split¬ 
ting  rule  (although  the  split  was  not  retained).  In  ten 
independent  runs  of  this  simulation,  the  same  qualitative 
results  were  obtained. 

First,  this  is  a  hard  problem.  A  plot  (not  shown)  of 
the  observed  event  rates  Ci/ii  versus  the  class  shows  con¬ 
siderable  overlap.  Classes  1-3  were  never  well  resolved, 
and  the  high  error  rate  for  the  true  predictors  makes 
deep  trees  difficult  for  this  sample  size. 

Secondly,  even  with  shrinkage  the  cross-validation  cri¬ 
teria  seems  to  recommend  trees  that  are  too  small.  The 
‘best^  tree,  i.e.,  the  one  with  lowest  cross-validation  error, 
sometimes  missed  informative  splits,  such  as  the  split  on 
X^  at  the  bottom  of  figure  2  (it  also  sometimes  included 
an  uninformative  split).  The  T  SE’  rule,  however,  con¬ 
sistently  trimmed  off  1-2  informative  splits  from  the  best 
tree. 

Third,  the  method  is  asymptotically  consistent. 
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When  the  average  time  of  observation  U  was  increased 
to  10,  keeping  the  same  event  rates  (so  we  have  10  times 
the  information),  a  perfect  model  was  always  found.  The 
best  tree  was  the  same  as  the  1-se  tree,  and  was  based 
on  9  informative  splits. 

Further  research  needs  to  be  done  with  this  example, 
including 

•  other  values  of  the  shrinkage  parameter  k 

•  the  effect  of  increasing  observation  time  per  subject, 
versus  increasing  the  number  of  subjects. 

•  shrinking  trees,  as  in  Hastie  [2] 

•  other  measures  of  prediction  error 

One  other  measure  of  prediction  error  was  examined 
briefly.  We  know  that  in  the  multinomial  classification 
problem  the  same  edge  effect  can  occur  when  the  de¬ 
viance  is  used  as  the  error  measure  and  an  observed  rate 
is  near  zero  or  one.  This  can  be  ameliorated  by  using 
the  simple  sums  of  squares  error  ||  Pi  -  P  ||^,  where  p  is 
the  predicted  probability  vector  for  a  node  and  p;  is  the 
observed  vector  for  a  subject  (zeros  with  a  single  1).  By 
auialogy,  we  might  expect  (cj/t,-  —  A)^  to  avoid  some  of 
the  problems  with  skewness  associated  with  the  Poisson 
deviance  measure.  Sadly,  this  did  not  hold  true. 

5  Relation  to  other  work 

One  obvious  use  of  this  software  is  for  survival  data. 
The  censoring  indicator  6  =  0, 1  becomes  the  number  of 
events  for  a  subject,  and  the  follow-up  time  is  used  as  the 
time  on  test.  In  this  case  the  likelihood  ratio  test  for  two 
Poisson  subsamples  is  equivalent  to  the  likelihood  ratio 
test  for  two  exponentials,  and  our  splitting  rule  is  the 
one  proposed  by  Davis  [4].  He  also  noticed  the  problem 
with  nodes  that  have  only  a  few  events,  leading  to  an 
infinite  estimate  of  cross- validated  error,  and  proposed 
an  ad  hoc  shrinkage  estimate  for  A,  His  final  suggestion 
is  to  use  the  cross-validation  results  only  as  a  guide  to 
choosing  the  right  tree. 

LeBlanc  and  Crowley  [5]  also  consider  the  case  of  sur¬ 
vival  data,  but  base  their  splitting  rule  on  the  local  full 
likelihood.  This  procedure  is  equivalent  to  the  following: 

•  Rescale  the  time  values  within  the  node  so  that  the 
cumulative  hazard  is  linear,  i.e.,  replace  each  U  with 
H{ti)  where  is  a  piecewise  linear  estimate  of  the 
cumulative  hazard. 

•  Use  the  usual  exponential  deviance  statistic,  but 
with  the  rescaled  time  values 


As  a  practical  matter,  they  suggest  only  rescaling  the 
data  once,  at  the  first  split.  Thus,  our  procedure  can 
mimic  theirs  simply  by  prescaling  the  data  before  calling 
the  routine. 

6  Software 

A  standalone  program  that  implements  this  tech¬ 
nique  is  available  from  statlib.  Send  the  mes¬ 
sage  ’’send  rpart  from  general”  to  the  fictitous  user 
statlib@lib.stat.cmu.edu.  The  routine  also  can  handle 
categorical  data  using  the  Gini  criteria  and  regression 
problems  using  the  between  groups  sum  of  squares. 

A  set  of  S  functions  for  the  same  task  should  be  sub¬ 
mitted  to  statlib  soon  (some  documentation  is  unfin¬ 
ished),  People  who  wish  to  try  out  an  early  release  can 
send  mail  to  the  author  at  therneau@mayo.edu. 
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Abstract 

A  regression  model  is  estimated  nonparamet- 
rically  using  regression  splines  to  model  non¬ 
linear  components  with  the  dependent  variable 
transformed  using  a  Box-Cox  transformation. 
The  knots  for  each  component,  the  regression 
variables  and  the  data  transformation  are  se¬ 
lected  using  a  Bayesian  approach  with  the  com¬ 
putation  carried  out  using  the  Gibbs  sampler. 
This  extends  previous  work  on  Bayesian  vari¬ 
able  selection  which  assumes  that  variables  en¬ 
ter  linearly.  The  performance  of  the  proposed 
nonparametric  estimator  is  applied  to  a  number 
of  examples  and  shown  to  work  weU  in  prac¬ 
tice.  By  exploiting  the  special  features  of  a 
spike  and  slab  prior  for  the  regression  coeffi¬ 
cients,  our  variable  selection  algorithm  is  much 
faster  than  previous  Bayesian  variable  selection 
algorithms. 

1  Introduction 

We  estimate  a  regression  model  semiparamet- 
rically  using  cubic  regression  splines  to  model 
nonlinear  components.  In  this  paper  we  confine 
the  discussion  to  additive  regression  models  but 
the  approach  extends  in  a  straightforward  way 
to  a  regression  model  with  interactions.  We 
conjecture  that  most  nonlinear  regressors  ob¬ 
served  in  practice  are  well  approximated  by  a 
regression  spline  with  just  a  few  knots,  if  those 
knots  are  carefully  selected.  If  too  many  knots 
axe  used  to  estimate  a  nonlinear  function  which 
is  observed  with  noise  then  a  poor  smooth  with 
high  local  variance  can  result.  Because,  in  gen¬ 
eral,  we  do  not  know  how  to  optimally  place 
the  knots  for  each  variable,  we  use  many  knots 
for  each  variable  and  select  the  important  knots 
using  Bayesian  variable  selection.  We  note  that 
our  approach  selects  which  independent  vari¬ 
ables  enter  the  regression  and  so  extends  pre¬ 
vious  work  on  variable  selection  in  linear  re¬ 
gression  by  Mitchell  and  Beauchamp  (1988)  and 


George  and  McCulloch  (1993,  1994).  We  also 
allow  the  dependent  variable  to  be  transformed 
using  a  Box-Cox  transformation  taking  a  dis¬ 
crete  number  of  values. 

We  show  that  our  procedure  works  well  on  a 
number  of  simulated  examples.  In  the  one  di¬ 
mensional  case  we  compare  the  nonparametric 
smooth  obtained  by  Bayesian  variable  selection 
with  that  obtained  by  the  kernel  based  locally 
linear  least  squares  smoother,  with  the  band¬ 
width  parameter  estimated  by  the  direct  plu¬ 
gin  procedure  developed  Ruppert,  Sheather  and 
Wand  (1993).  This  plugin  estimator  is  among 
the  best  performing  bandwidth  estimators  for 
locally  linear  least  squares  kernel  regression. 

Because  of  the  large  number  of  variables  in¬ 
volved,  the  computation  is  carried  out  using  the 
Gibbs  sampler  with  the  error  variance,  the  re¬ 
gression  parameters  and  the  Box-Cox  parame¬ 
ter  integrated  out.  We  place  a  slab  and  spike 
prior  on  the  regression  parameters  and  exploit 
this  prior  to  obtain  a  fast  Bayesian  variable  se¬ 
lection  algorithm.  When  the  number  of  vari¬ 
ables  selected  is  substantially  smaller  than  the 
number  available,  which  is  almost  always  the 
case  in  our  applications,  then  our  approach  can 
be  substantially  faster  than  that  proposed  by 
George  and  McCulloch  (1994)  who  also  inte¬ 
grate  out  the  error  variance  and  the  regres¬ 
sion  parameters.  A  more  detailed  comparison 
of  our  approach  with  that  of  George  and  Mc¬ 
Culloch  (1993,  1994)  is  given  in  Section  7. 

Our  approach  to  nonparametric  regression 
has  a  number  of  advantages  over  previous  work. 
First,  we  just  use  a  linear  regression  frame¬ 
work  which  is  easy  to  understand  and  allows  the 
usual  linear  regression  diagnostics  to  be  carried 
out  after  the  model  is  estimated.  Most  opti¬ 
mal  nonparametric  regression  estimators  such 
as  splines  and  kernel  based  nonparametric  esti¬ 
mators  are  quite  esoteric  to  the  general  user,  es¬ 
pecially  when  smoothing  parameters  need  to  be 
estimated  as  well.  Second,  our  approach  is  very 
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general  and  can  handle  additive  models  with 
interaction  terms  and  can  select  the  significant 
independent  variables.  At  present,  kernel  based 
methods  cannot  handle  additive  models  when 
reliable  bandwidth  estimation  is  also  required. 
There  do  not  seem  to  be  at  present  reliable  ways 
of  doing  variable  selection  using  spline  smooth¬ 
ing  with  the  exception  of  some  ad-hoc  meth¬ 
ods  such  as  the  Bruto  algorithm  proposed  in 
Hastie  and  Tibshirani  (1990,  p.  262).  Friedman 
and  Silverman  (1989)  and  Friedman  (1991)  also 
use  regression  splines  for  nonparametric  regres¬ 
sion  and  select  the  knots  by  a  cross-validation 
procedure.  This  is  computationally  very  in¬ 
tensive  and  makes  it  difficult  to  traverse  aU 
possible  knot  combinations  when  seeking  opti¬ 
mal  knot  allocation.  Hastie  (1989)  notes  that 
the  knot  selection  procedure  in  Friedman  and 
Silverman  (1989)  can  produce  unsatisfactory 
model  fits.  A  third  advantage  of  our  procedure 
is  that  it  is  very  fast  compared  to  many  other 
nonparametric  regression  estimators.  Except 
for  an  initial  0(n)  calculation,  our  procedure 
is  independent  of  sample  size.  Spline  smooth¬ 
ing  using  either  generalised  cross-validation  or 
marginal  likelihood  to  estimate  the  smoothing 
parameter  generally  requires  0(»^)  operations, 
e.g.  Gu  and  Wahba  (1991)  with  some  savings 
available  for  specialised  models.  Kernel  based 
nonparametric  regression  requires  0('n?)  oper¬ 
ations  but  can  be  considerably  speeded  up  by 
using  binning  as  in  Fan  and  Marron  (1994).  Fi¬ 
nally,  our  approach  allows  the  dependent  vari¬ 
able  to  be  transformed  as  an  integral  part  of  the 
estimation.  This  can  only  be  done  on  an  ad-hoc 
basis  using  spline  or  kernel  fitting. 

The  paper  is  structured  as  follows.  Section  2 
describes  variable  selection  for  linear  regression 
and  explains  how  the  Gibbs  sampler  is  used  to 
find  the  model  with  the  highest  posterior  prob¬ 
ability.  Section  3  presents  our  approach  to  non¬ 
parametric  regression  in  the  univariate  case  and 
empirically  compares  its  performance  to  ker¬ 
nel  based  locally  linear  least  squares  smooth¬ 
ing.  Section  4  generalises  the  treatment  in  Sec¬ 
tion  2  to  include  transformation  of  the  depen¬ 
dent  variable  as  part  of  the  Bayesian  analysis. 


Section  5  deals  with  semiparametric  additive  re¬ 
gression.  Section  6  gives  implementation  de¬ 
tails  for  variable  selection  and  transformation 
of  the  dependent  variable  in  a  linear  regression 
model.  Section  7  compares  our  approach  to 
variable  selection  with  that  of  George  and  Mc¬ 
Culloch  (1993,  1994). 

2  Variable  selection  in  a  linear 
regression  model 

In  this  section  we  review  variable  selection  in 
the  linear  regression  model  as  it  is  the  basis  of 
our  nonparametric  procedure.  We  consider  the 
linear  regression  model 

y  =  Xp  +  e  (2.1) 

where  y  is  the  n  x  1  vector  of  observations,  X 
is  the  n  X  r  design  matrix,  e  ~  i\r(0,a^/„)  is 
the  error  vector  and  =  (/3i, . .  .,/?r)^  is  the 
r  X  1  vector  of  regression  coefficients.  Let  7 
be  the  r  x  1  vector  of  indicator  variables  with 
ith  element  7,  such  that  7,-  =  0  means  that 
j3i  =  0  and  7,-  =  1  means  that  /?,•  7^  0.  Given  7, 
let  /?.y  consist  of  aU  the  nonzero  elements  of  /? 
and  let  X-,  be  the  columns  of  X  corresponding 
to  those  elements  of  7  that  are  equal  to  one. 
Given  7  and  cr^,  we  take  the  prior  for  as 

~  (0,  co’^(Xj^X.y)“^^ ,  where  c  is  a 

positive  scale  factor  specified  by  the  user.  In 
the  empirical  work  we  take  c  =  100  and  find 
it  performs  well  and  makes  the  prior 
almost  diffuse.  We  take  the  prior  of  given 
7  as  p(cr^|7)  oc  1/<t^.  Finally,  we  take  the  7,- 
as  apriori  independent  with  p(7,-  =  1)  =  x,-, 
0  <  TT,-  <  1,  for  i  =  l,...,r.  In  our  appli¬ 
cations  we  take  the  w,-  =  |  which  means  that 
each  model  7  has  a  prior  probability  equal  to 
2“’’.  Taking  the  tt,-  smaller  than  |  will  result 
in  a  more  parsimonious  model.  Our  aim  in  this 
paper  is  to  select  the  model  with  the  highest 
posterior  probability,  that  is  the  highest  value 
of  p{7\y)-  This  is  equivalent  to  maximising 
p{y\j)p{7)-  By  integrating  and  out  we 
obtain  that 

Pivh)  oc  (1  +  c)2®t'5(7)“2” 


(2.2) 
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where  7i  is  the  number  of  nonzero 

elements  of  (3  and 

S{i)  =  y'y-^^y'x^{x'^x^)-^x'^y  (2.3) 

so  that 

P{l\y)  «  (1  +  c)2’t'5'(7)“2”  U  7r7‘(l  - 

»=i 

To  obtain  the  model  with  the  highest  poste¬ 
rior  probability  it  is  necessary  to  search  over  2’’ 
models.  This  can  be  done  directly  if  r  is  small. 
In  our  applications  r  will  usually  be  large  so 
a  direct  search  is  not  feasible  and  we  use  the 
Gibbs  sampler  (Gelfand  and  Smith,  1990)  to 
traverse  the  parameter  space.  Our  use  of  the 
Gibbs  sampler  can  be  described  as  follows. 
Gibbs  sampler  (i)  Choose  an  initial  value 

7^°^  =  (7!°^, . . . ,  7r°^)  of  7  perhaps  by  generat¬ 
ing  it  from  some  distribution,  (ii)  Successively 
generate  from  p(7j|y,  7^54,).  Step  (ii)  is  carried 
out  many  times  and  in  two  stages.  The  first 
stage  is  a  warmup  period  at  the  end  of  which 
it  is  assumed  that  the  sampler  has  converged 
to  the  joint  distribution  of  p{j\y).  The  second 
stage  is  a  sampling  period  and  the  7,-  collected 
during  this  period  are  used  for  inference. 

We  note  that  as  the  7,  are  generated,  the  pos¬ 
terior  probability  p{'f\y)  is  also  calculated  (up  to 
a  constant  independent  of  7)  so  that  of  the  mod¬ 
els  generated  thus  far  the  one  with  the  highest 
posterior  probability  can  be  recorded. 

The  Gibbs  sampler  can  be  executed  very  effi¬ 
ciently  because  usually  will  be  much  smaller 
than  r  in  our  problems.  Implementation  details 
are  given  in  Section  6. 

3  Univariate  nonparametric 
regression 

Suppose  that 

Vi  —  i  —  1,  . .  .,  71  (^‘1) 

where  j/,-  is  the  th  observation,  e,-  is  an  inde¬ 
pendent  ^^(O,^^)  error  sequence  and  f(x)  is  a 


smooth  function.  We  propose  to  approximate 
/(x)  by  the  cubic  regression  spline 

m 

60  +  +  hx^  +  (3.2) 

jfe=i 

where  ii , . . . ,  are  the  m  ‘knots’  placed  along 
the  domain  of  the  independent  variable  x,  such 
that  min(x,)  <  xi  <  . . .  <  x^  <  max(x,),  while 
{z)+  =  max(0,  z).  By  replacing  /(x)  in  (3.1)  by 
its  approximation  (3.2)  the  nonparametric  re¬ 
gression  can  be  rewritten  as  a  linear  regression. 
Let  r  =  7n -f- 4,  /?  =  {boMMM,  fiu  ■  ‘  ,  Pm)' , 
X  =  (xi,...,Xn)'  and  let  1  be  a  vector  of 
n  I’s.  Also,  let  the  n  x  r  matrix  X  = 
(l,x,x^x3,(x-  15i)5.,...,(x-  lxm)+). 

Then,  with  /(x)  replaced  by  (3.2),  we  can  write 
(3.1)  as  (2.1) 

The  most  important  question  associated 
with  fitting  regression  splines  is  the  choice  of 
both  the  number  and  location  of  the  knots 
,  ®m;  see,  for  example,  Friedman  and  Sil¬ 
verman  (1989)  and  Friedman  (1991).  If  the 
knots  are  badly  located,  details  of  the  curve  can 
be  missed,  while  if  too  many  knots  are  included 
the  fitted  spline  based  on  these  knots  will  have 
high  local  variance.  One  way  solve  the  problem 
is  to  introduce  a  large  number  of  potential  knots 
from  which  a  significajit  subset  can  be  selected, 
e.g.  Friedman  and  Silverman  (1989,  pp.  9-11). 
The  problem  then  becomes  one  of  variable  selec¬ 
tion  where  each  knot  corresponds  to  a  column 
of  a  design  matrix  from  which  a  significant  sub¬ 
set  is  to  be  determined.  Although  the  number 
of  knots  selected,  m,  will  typically  be  large  so 
that  r  will  be  large,  the  number  of  significant 
variables  q  required  to  obtain  a  good  approxi¬ 
mation  wiU  usually  be  quite  small.  This  is  what 
makes  our  algorithm  so  fast. 

We  look  at  the  performance  of  our  approach 
and  compare  it  to  local  linear  smoothing  for 
data  sets  generated  from  the  following  three 
curves. 

yi  "  2xj  -|-  6,*  (3.3) 

where  iidAr(0,0.52), 

Vi  =  sin(87rx,)  +  a  (3.4) 
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where  e,-  ~  iidN(0, 0.5^),  and 

Vi  =  ^(®.)  +  (3-5) 

where  g(x)  =  106”^°®'  +  2  +  e.-  if  Xf  <  |  and 
g(x)  =  3  cos(107rx,‘)  +  e,'  if  if  Xf  ^  2* 
errors  e;  ^  iidiV^(0, 2^).  One  hundred  observa¬ 
tions  were  drawn  from  a  Uniform(0,l)  distribu¬ 
tion,  forming  the  independent  variable  for  each 
of  the  three  functions.  The  errors  were  also  ran¬ 
domly  generated,  while  the  knots  were  chosen  to 
follow  the  density  of  the  independent  variable, 
one  every  three  observations.  This  produced  a 
total  of  m  =  33  knots  and  r  =  37  columns  in  X 
from  which  to  select.  The  Gibbs  sampler  was 
run  for  a  warmup  period  of  300  iterations  and 
a  sampling  period  of  3000  iterations,  with  arbi¬ 
trary  initial  condition  7^°^  =  (1,0, . . .,  1,0, 1/. 
Convergence  seems  to  have  occurred  within  a 
dozen  iterations  for  each  of  the  three  functions. 
When  the  variables  selected  by  the  Bayesian 
approach  were  placed  in  a  linear  least  squares 
routine  they  were  all  significant  at  the  1%  level. 
Figures  1(a)- 1(c)  show  plots  of  the  least  squares 
fits,  based  on  the  obtained  model  estimates, 
against  each  set  of  generated  data  and  respec¬ 
tive  true  curve.  Figures  2(d)-(f)  show  the  corre¬ 
sponding  fits  obtained  to  the  same  data  sets  us¬ 
ing  local  linear  kernel  based  regression.  Smith 
and  Kohn  (1994)  repeat  the  above  simulation 
100  times  and  show  that  the  three  data  sets  gen¬ 
erated  are  typical  data  sets  for  the  models  (3.3)- 
(3.5).  The  six  plots  in  Figure  1  show  that  the 
regression  spline  estimator  performs  weU  and  is 
smoother  than  the  local  linear  estimator.  This 
has  also  been  our  experience  with  other  data 
sets. 

A  more  extensive  set  of  simulations  and  com¬ 
parisons  with  locally  linear  least  squares  is  given 
by  Smith  and  Kohn  (1994). 

4  Data  transformation 

We  now  generalise  the  model  (2.1)  by  allowing 
the  dependent  variable  to  be  transformed  using 
a  Box-Cox  transformation.  Given  the  indicator 
vector  7,  the  linear  model  becomes 

yA  =  Xy/3^  H-  e  (4.1) 


where  y,-,A  =  if  A  0  and  y^.A  =  log(y,)  if 
A  =  0.  As  is  normal  when  using  the  Box- Cox 
transformation,  we  assume  that  the  dependent 
variable  y,-  is  positive.  Otherwise,  some  positive 
number  is  added  to  all  the  observations  to  make 
this  so.  In  order  to  carry  out  both  variable  se¬ 
lection  and  transformation  selection  using  the 
Gibbs  sampler  it  will  be  necessary  to  integrate 
out  A.  To  facilitate  this  we  allow  A  to  take  on 
just  a  small  set  of  values  denoted  by  A.  In  our 
examples  we  take  A  =  {— 2,  — 1,  — 5,0,  |,  1,2}, 
which  wiU  be  adequate  for  most  applications. 
Our  aim  is  to  find  the  values  of  A  and  7  that 
give  the  highest  posterior  probability  p(A,7|y). 
To  find  this  combination  of  A  and  7  we  run 
the  Gibbs  sampler  as  in  Section  2  by  generat¬ 
ing  from  *  =  1,  •  •  .,7*.  To  evaluate 

p(7|y)  we  note  that 

p(7ls/)  =  X)Kyl^.7)p(^)p(7) 

A6A  A6A 

and  p(yIA,7)  =  p(yA|A,7)A'^).  where  7(A)  is 
the  Jacobian  of  the  transformation  y  yx  and 
is  equal  to  FliUi  |A|y^'  if  A  ^  0  and  Ui=i  ^  « 
A  =  0.  From  (2.2)  and  (2.3)  we  obtain 

p(y|A,7)  oc  (1  +  c)5«-5(A,7)-^V(A)  (4.2) 

where 

5(A,7)  =  sxw  -  {XyX-,y' 

(4.3) 

The  prior  for  7  is  the  same  as  in  Section  2  and 
in  our  applications  we  take  a  uniform  prior  on 
Ae  A. 

We  found  it  necessary  to  integrate  A  out  when 
generating  7.  The  Gibbs  sampler  generating 
7.iy>7i?t.-,'^>  i  =  l,...,r  and  A|y,7  tended  to 
get  stuck,  because  of  the  high  correlation  be¬ 
tween  the  A  and  7  iterates.  If  A  takes  on  only  a 
small  number  of  values  then  the  variable  selec¬ 
tion  algorithm  can  be  very  fast  as  the  terms 
y'^yx  and  y\X  can  all  be  precalculated.  For 
each  of  the  models  generated  by  the  Gibbs  sam¬ 
pler  it  is  straightforward  to  calculate  the  den¬ 
sity  p(A,7jy)  a  p(yAi7»  A)7(A)p(7)p(A),  up  to 
a  constant  independent  of  A  and  7,  which  en¬ 
ables  us  to  keep  track  of  the  values  of  A  and  7 
maximising  the  posterior  density. 
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To  illustrate  the  performance  of  our  approach 
to  simultaneously  determining  7  and  A  we  gen¬ 
erated  100  observations  from  (3.3)-(3.5)  as  in 
Section  3.  For  the  data  generated  from  (3.3)  we 
transformed  y,-  (y;  -f- 1)“2 ,  for  the  data  gen¬ 

erated  from  (3.4)  we  transformed  y,-  exp(y,  -l- 
2.5)  and  for  the  data  generated  from  (3.5)  we 
transformed  y,-  (y,-  -|-  7)~^.  Figure  5  plots 

the  transformed  data  in  the  left  panels  and  the 
original  data,  together  with  the  curve  estimate 
and  the  true  curve,  in  the  right  hand  panels  for 
each  of  the  three  functions.  It  is  clear,  that  at 
least  for  these  realisations,  the  nonparametric 
approach  with  variable  selection  performs  very 
well.  Further  simulations  indicated  that  this 
combined  approach  is  highly  effective. 

5  Additive  semiparametric 
regression 

Because  regression  splines  are  linear  models  it 
is  possible  to  employ  them  in  an  additive  model 
context  by  constructing  a  single  design  matrix 
made  up  of  columns  of  the  individual  design 
matrices  of  the  type  outlined  in  the  previous 
section.  Model  selection  can  then  be  performed 
simultaneously  on  the  knots  (and  other  polyno¬ 
mial  terms)  associated  with  each  independent 
variable  modelled  by  a  regression  spline,  by  se¬ 
lecting  from  the  columns  of  this  new  design  ma¬ 
trix. 

The  next  example  illustrates  the  performance 
of  our  approach  to  variable  selection  and  data 
transformation  on  a  four  component  additive 
regression  model.  Two  hundred  observations 
were  generated  from 

Vi  =  exp(/i(®i,)  +  /2(a;2t)  +  /3(®3.)+ 

f4ix4{)  +  e.) . 

The  errors  e,-  are  independent  iV’(0, 0.5^), 
fi(z)  =  sin(27r2r),/2(z)  =  -l.bzjsiz)  = 

cos(6Trz)  and  /t  is  nuU.  The  independent  vari¬ 
ables  Xu, X4i,  i  =  1, . . . ,  n,  are  each  gener¬ 
ated  from  a  uniform  distribution.  Figures  6(a)- 
6(d)  plot  Vi  against  each  of  the  independent  re¬ 
gressors  and  show  that  it  is  difficult  to  deter¬ 
mine  the  functional  forms  fi,  ■■  .,f 4  from  these 


plots.  The  additive  model 

yi,X  =  /l(a;it)  +  f2{x2i)  +  fz{xsi)  -f  /4(®4t)  +  Cf 

was  fitted  to  the  data  using  the  Bayesian  ap¬ 
proach  explained  above,  with  A  taking  the  7 
values  given  in  Section  4.  Each  function  /y 
was  approximated  by  a  regression  spline  with 
13  knots,  one  every  15  observations.  We  ran 
the  Gibbs  sampler  with  the  initial  value  of  7  = 
(1, 0, 1 . . ., , 0, 1),  a  warmup  period  of  300  itera¬ 
tions  and  a  sampling  period  of  3000  iterations. 
The  posterior  mode  of  A  and  7  produced  a  log 
transformation,  the  estimate  of  /i  included  lin¬ 
ear  and  quadratic  terms  together  with  two  ex¬ 
tra  knots,  the  estimate  of  /2  was  linear,  the  esti¬ 
mate  of  fz  required  the  squared  and  cubic  terms 
plus  six  extra  knots  and  the  estimate  of  was 
null.  This  means  that  out  of  r  =  65  poten¬ 
tial  regressors,  q  =  14  were  selected.  The  B? 
for  this  model  was  0.867.  Figure  6(e)  plots  the 
transformed  data  (scatter  plot),  the  true  value 
of  /i  (solid  line)  and  its  estimate  (da.shed  line) 
against  ^i,-.  Figures  6(f),  6(g)  and  6(h)  are  sim¬ 
ilar  plots  for  /2  to  /i,  with  /i  null.  These  plots 
show  that  for  this  simulated  data  set  our  ap¬ 
proach  selects  the  correct  data  transformation 
and  provides  good  estimates  of  the  components. 
In  particular,  the  null  component  is  omitted 
from  the  model. 


6  Implementing  the  Gibbs 
sampler 


We  outline  how  to  efficiently  implement  the 
Gibbs  sampler  described  in  Section  2  and  ex¬ 
tend  the  result  to  the  data  transformation  case 
discussed  in  Section  4.  Before  running  the  sam¬ 
pler  the  terms  y'y,  X'y  and  X'X  are  computed. 
To  generate  7,-,  we  note  that  p('ti\y,jj^i)  is  bi¬ 
nomial  withp(7,-  =  l|y,7y5£,)  =  1/(1 -|- A),  where 


h  = 


1  -  Xj 


TT; 


(c+1) 


S(joy 


7^  =  (7i.---.7t-i,7.-  =  l,7i+u---,7r)  and 
7°  =  (7i,---i7i-i,7i  =  0,7i+i,.  ..,7r).  Sup¬ 
pose  that  7  =  7°  before  7,-  is  generated. 
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Then  5(7°)  is  known  and  it  is  necessary 
to  obtain  The  main  computational 

difficulty  in  obtaining  5(7^)  is  evaluating 

y'X^\  (x'lXy)  ^  X'^iy.  This  is  done  by  fac¬ 
toring  as  L\L\,  where  £i  is  lower 

triangular,  using  the  Cholesky  decomposition 
and  then  computing  L^^X'^iy.  We  note  that 
X'^iX^i  and  X'^X^o  differ  by  only  one  row  and 
column  so  that  L\  can  be  readily  obtained  from 
Lq,  where  Xo^o  Cholesky  decomposition 

of  X'^qX^\  see  Dongarra,  Moler,  Bunch  and 
Stewart  (1979,  Ch.  10).  If  7  =  7^  before  7,- 
is  generated,  then  Lq  can  similarly  be  obtained 
from  Li-  From  Dongarra  et  al.  (1979),  generat¬ 
ing  7i  requires  operations.  Hence  generat¬ 
ing  7  requires  0{rq^)  operations,  where  q  is  the 
typical  number  of  regressors  required.  We  refer 
the  reader  to  Dongarra  et  al.  (1979)  for  a  dis¬ 
cussion  of  fast  and  stable  methods  for  updating 
a  Cholesky  decomposition. 

When  the  dependent  is  transformed  as  well, 
we  first  obtain  the  terms  y'^^yx^X'yx  and  X'X 
for  each  value  of  A  €  A.  Fast  calculation  of 
5(A,7)  is  done  as  above. 

7  Discussion  of  related  work 

Differences  in  approaches  to  Bayesian  model 
selection  revolve  primarily  around  the  specifi¬ 
cation  of  the  conditional  prior  /?|7,  <7^  because 
it  introduces  the  indicator  variables  into  the 
model.  Mitchell  and  Beauchamp  (1988,  p.l024) 
use  a  uniform  prior,  letting  j0|7,<r^  ~ 

Uniform(— a,-,ai),  with  Oj  large  for  each  i.  The 
decision  of  how  large  to  choose  the  values  of  Oj 
is  left  to  the  user. 

George  and  McCulloch  (1993)  use  the  non¬ 
conjugate  normal  prior  /3,j7,a^  ~  W(0,r,^)  if 
7<  =  0  and  Pi\'y,cr^  ~  Ar(0,c?r?)  if  7;  =  1.  The 
constants  Tj  and  c,-  are  chosen  so  that  ri  is  small 
and  Ci  is  large.  George  and  McCulloch  (1993) 
make  some  suggestions  on  suitable  choices  for 
Ci  and  Ti  and  use  the  following  Gibbs  sampler 
to  generate  models  of  high  probability:  Gen¬ 
erate  from  (a)  p(/3|y,  <7^,7);  (b)  p(cr2|y,/3,7); 
(c)  P(7i|y./^.<^^7i#t)  for  i  =  l,...,n  We  have 


found  this  sampler  difficult  to  implement  for 
our  problems  because  of  the  high  correlation  be¬ 
tween  P  and  7.  If  Tj  is  chosen  too  small  then 
the  sampler  is  nearly  degenerate  and  tends  to 
get  stuck.  If  Ti  is  chosen  too  large,  significant 
terms  are  omitted  and  high  local  bias  is  experi¬ 
enced.  We  note  that  this  sampler  requires  0(r®) 
operations  to  generate  which  can  be  consid¬ 
erably  slower  than  our  algorithm  if  q  is  much 
smaller  than  p. 

George  and  McCulloch  (1994)  consider  the 
conjugate  prior  /?,(7,(7^  ~  iV(0,  crV?)  if  7,-  =  0 
and  )0,j7,a^  ~  ^(0,  a^c^r?)  if  7,-  =  1  and  ob¬ 
tain  p{'t\y)  by  integrating  out  and  <7^.  Given 
Ci  and  Ti  they  use  the  Gibbs  sampler  in  Sec¬ 
tion  2  to  generate  the  ji.  The  computations  re¬ 
quired  are  carried  out  efficiently  using  the  fast 
Cholesky  updates  in  Dongarra  et  al.  (1979).  Be¬ 
cause  aU  variables  remain  in  the  regression  for 
each  value  of  7,  the  fast  Cholesky  implementa¬ 
tion  in  George  and  McCuUoch  (1994)  requires 
0(r^)  operations  to  generate  7,  which  can  be 
substantially  slower  than  our  approach  which 
requires  0{rq'^)  operations. 
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Figure  2:  (a)-(c)  Plots  of  transformed  data;  (d)-(f)  plots  of  original  data  (scatter  plot),  true  curves 
(solid)  and  estimated  curves  (dotted). 


Figure  3:  Parts  (a)-(d)  plot  the  transformed  data  against  the  four  independent  variables.  Part  (e) 
plot  the  transformed  data  (scatter  plot),  the  true  /i  and  its  estimate  against  iCij.  Parts  (f)-(h)  are 
similar  plots  for  /2  to  f^. 
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Abstract 

Space  filling  experimental  designs  evenly  distribute  design 
points  throughout  a  design  Space.  These  designs  are  usefiil 
for  applications  where  optimums  are  thought  to  exist  in 
distinct  areas.  A  space  filling  design  was  carried  out  to 
determine  a  best  set  of  storage  conditions  for  a  particular 
protein  construct.  The  design  consisted  of  96  points  and 
tested  the  effect  of  eight  experimental  variables  on  protein 
activity.  The  analysis  of  the  results  was  performed  using 
linear  regression,  recursive  modeling,  and  picking  the  set 
of  conditions  from  the  experimental  results  which  produced 
the  highest  result.  None  of  the  analysis  methods  were  found 
to  be  completely  satisfactory  for  the  analysis  of  these  data. 
This  experiment,  while  operationally  successfal, 
demonstrates  the  need  for  better  algorithms  and  analysis 
methods  for  generating  and  assessing  space  filling 
experimental  designs. 

Introduction 

Experimental  designs  are  used  in  industrial  applications  to 
determine  optimal  process  conditions  by  varying  many 
factors  simultaneously.  It  is  difficult  to  apply  classical 
experimental  designs  when  the  experimental  space  is 
irregular  in  shape,  certain  experimental  combinations  are 
not  physically  possible,  or  where  many  of  the  experimental 
factors  are  categorical.  Computer  generated  exact  D- 
optimal  designs  are  often  used  in  these  situations  (Snee, 
1985).  D-optimal  designs  are  based  upon  a  model  of  the 
process,  usually  linear, 

Y=:Xp  +  e, 

where  Y  is  a  column  vector  of  responses,  X  is  a  design 
matrix ,  P  is  a  column  vector  of  coefficients  to  be  estimated 
and  E  is  a  column  vector  of  errors  from  the  linear  model. 
Exact  D-optimal  algorithms  are  computer  intensive;  they 
start  with  a  random  design  and  then  delete  points  firom  the 
current  design  and  add  si>ecific  points  from  the 
experimental  space  to  maximize  the  determinate  of  the  XX 
matrix.  The  selected  points  tend  to  be  on  the  extremes  of 


the  experimental  space  and  it  is  assumed  that  the  linear 
model  can  be  used  to  interpolate  conditions  in  the  space. 

In  some  situations,  the  experimenter  is  not  willing  to 
assume  a  model  beyond  saying  that  points  near  one  another 
are  going  to  respond  similarly.  In  such  a  situation  it  is 
natural  to  place  points  throughout  the  space,  hence  the 
term  space  filling  designs.  Space  filling  designs  have  no 
underlying  model  and  try  to  best  fill  the  n-dimensional 
space  with  a  finite  number  of  points.  Some  algorithms  for 
space  filling  experimental  designs  are  becoming  available, 
Kennard  and  Stone  (1969)  and  SAS/QC  (1993),  however 
little  attention  has  been  given  to  the  subsequent  analysis  for 
these  designs.  If  there  are  local  regions  of  high  activity, 
then  use  of  linear  models  is  likely  to  be  unsatisfactory.  In 
many  industrial  settings,  experimenters  have  often  found 
that  only  a  few  of  the  hypothesized  important  factors  turn 
out  to  have  much  of  an  effect,  a  situation  called  effect 
sparsity.  If  effect  sparsity  holds  in  situations  where  space 
filling  designs  are  used,  then  the  analysis  method  should 
find  which  factors  are  important  and  what  regions  in  these 
subspaces  have  good  results.  There  do  not  seem  to  be 
standard  methods  for  finding  compact  regions  of  similar 
response  in  a  high  dimensional  space. 

We  used  a  space  filling  experimental  design  to  determine  a 
set  of  storage  conditions  for  a  purified  protein  construct.  In 
this  design  we  were  concerned  with  near  neighbor 
predictability,  dealing  with  an  irregular  sample  space  and 
estimatLng  the  pure  error  inherent  in  the  assay.  Various 
methods  of  ansJysis  were  used  to  examine  the  resulting 
dataset,  including  linear  regression,  recursive  modeling 
and  picking  the  maximum  value  from  the  results.  While 
the  experiment  was  operationally  successful  (a  good  set  of 
conditions  was  found),  many  problems  with  the 
constmction  of  the  design  and  analysis  were  raised. 

Description  of  Experiment 

The  purification  and  characterization  of  proteins  is  an 
important  process  in  the  first  stages  of  the  drug  discovery 
process.  Biotechnology  is  used  to  identify  important  regions 
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of  DNA  and  these  DNA  segments  are  spliced  into  a  vector 
for  expression.  The  resulting  protein  constructs  are  small 
purified  segments  of  protein  which  include  the  active  site 
and  have  the  same  activity  as  the  complete  parent  protein. 
The  use  of  purified  protein  constructs  is  becoming 
instrumental  for  determining  the  functionality  of 
biochemical  processes,  investigating  the  effects  of  novel 
pharmaceuticals  and  solving  tertiary  structures  of  large 
proteins,  Cunningham  and  Wells  (1991). 

In  order  to  stabilize  the  constructs  and  maintain  biological 
activity,  purified  protein  constructs  are  stored  in  a  buffered 
solution  containing  other  chemical  additives  such  as 
detergents,  reducing  agents,  and  salts.  The  correct 
combination  of  chemicals  that  make  up  these  solutions  is 
usually  found  by  performing  a  series  of  experiments  where 
the  different  solution  additives  are  varied  one  at  a  time. 
This  method  requires  a  large  number  of  experiments,  does 
not  have  the  ability  to  determine  how  the  different 
experimental  factors  interact,  and  does  not  provide  an 
estimate  of  experimental  variation. 

The  factors  and  their  ranges  thought  likely  to  contain  the 
optimum  stability  conditions  for  a  protein  construct  were 
selected  by  a  team  of  protein  chemists.  The  final  list 
contained  a  total  of  eight  storage  variables.  These  variables 
and  their  settings  were  chosen  based  on  previous 
experience  and  recent  ejqterimentation.  For  continuous 
variables  a  practical  experimental  range  was  determined 
from  which  three  or  more  settings  were  chosen.  The 
spacing  of  the  settings  were  selected  so  that  the  gj^ts 
between  settings  were  small  enough  that  a  narrow  optimum 
would  not  be  missed.  In  cases  where  only  three  settings 
were  to  be  tested  the  low  and  high  values  were  set  inside 
the  extreme  possible  conditions.  Conditions  were  selected 
so  that  they  would  not  interfere  with  subsequent 
experimental  processes  such  as  protein  crystallogrt^hy  or 
biological  assay. 

The  selected  conditions  ate  given  in  Table  1;  the  total 
number  of  combinations  of  all  variables  and  levels 
produced  a  candidate  set  of  18,144  possible  experimental 
conditions.  Next,  bufifer/pH  combinations  which  were  not 
biologically  or  chemically  practical  were  excluded  from  the 
candidate  set.  For  example  the  MES  buffers  at  pH  higher 
that  6.5,  the  TRIS  buffers  at  pH  higher  than  8.0  and  the 
HEPES  buffers  where  pH  was  below  6.5  or  higher  than  8.0. 
The  exclusion  of  these  combinations  reduced  the  candidate 
set  to  9720  possible  experimental  conditions. 


Table  1 

Elxperlmental  factors  and  ranges  considered 
important  for  the  optimal  storage  condition  of 
purified  protein  constructs 


Buffers 

Tris,  PO4,  Mes,  Hcpcs 

pH 

6-9 

Protein 

Concentration 

100-1000  ug/ml 

Reducing 

Agents 

BME,  TCEP,  DTT 

Detergents 

Tween,  Ethylene  glycol 
NP-40,  Octylglucoside 

Temperature 

-80,-20,4  °C 

Naa 

100-1000  mM 

Mga 

Yes /No 

Number  of 

18,144 

Possible  £]q>eriments 

Number  of  Eiqperlments  96 

Performed 


Design  Generation  and  Experimental  Results 
The  D-optimal  exchange  algorithm  of  Mitchell  and  Miller 
(1970)  as  coded  in  Proc  Optex  of  SAS®  was  used  to 
choose  design  conditions  from  the  candidate  set.  Main 
effects,  quadratic  effects  and  two  way  interactions  were 
included  in  the  model  to  force  the  algorithm  to  fill  the 
experimental  space.  Algorithms  such  as  those  available  in 
version  6.07  of  SAS  Proc  Optex  fill  a  multidimensional 
design  space  mote  efficiently.  However  at  the  time  this 
experiment  was  performed,  a  satisfactory  set  of  design 
points  using  these  algorithms  was  unobtainable  because  of 
the  large  number  of  class  variables  in  the  design.  A 
reference  set  of  conditions  was  forced  into  the  design  and 
run  in  triplicate  to  give  both  an  experimental  "gold 
standard"  and  allow  an  estimate  of  pure  error.  The  sample 
size  for  the  experiment  was  set  at  96,  93  separate 
conditions  along  with  3  replications  of  a  "gold  standard" 
set  of  conditions. 

The  protein  activity  of  each  of  the  96  samples  was 
determined  once  a  week  for  a  total  of  four  weeks.  The 
activity  recorded  after  the  fourth  week  was  used  for  the 
analysis. 
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Desiga  Generatioo  Results 

Two  dimensional  views  of  the  numerical  variables  showed 
that  in  most  cases  there  were  representative  points  for  each 
factor  (Figure  1).  However  in  some  cases  the  interior  areas 
of  the  design  space  were  not  well  represented.  Currently 
there  are  no  criterion  for  assessing  how  well  points  fill  a 
space. 

B0ur«  1  Quantitative  Factors 


pH 

•  •  ; 

.  .  ; 

;  • 

NmI 

.  . 

• 

• 

Prot_Con 

■ 

MgCL 

T*mp 

Analysis:  Linear  Regression 

Analysis  of  the  dataset  using  linear  regression  produced  a 
predictive  model  with  a  large  number  of  statistically 
significant  predictor  variables;  results  of  this  analysis  are 
given  in  Table  2.  Numerous  two  way  interaction  and 
quadratic  terms  were  found  making  simple  interpretation  of 
specific  experimental  factors  difficult.  Other  causes  of 
concern  were  the  possibility  overfitting  and  the  apparent 
violation  of  effect  sparsity.  Table  3  gives  the  pre^cted 
req>onses  and  cross  validation  results  for  the  best  predicted 
conditions.  These  results  demonstrate  that  while  in  some 
cases  the  model  was  able  to  predict  within  assay  error,  in 
other  cases  the  predicted  response  was  erroneous,  thus 
demonstrating  the  relative  importance  of  using  local  points 
to  predict  in  tiie  sparse  design  space. 


Table  2 

Linear  Regression  Resuits 


Source 

df 

SSq 

F  ratio 

p  Value 

C.  Total 

95 

38.46 

Buffer 

3 

1.47 

6.12 

0.0013 

Buffer  XNaCI 

3 

0.82 

3.43 

0.0245 

Buffer  X  Red^gent 

6 

1.45 

3.01 

0.0142 

Buffer  X  Detergent 

9 

2.49 

3.45 

0.0025 

Buffer  X  Temp 

3 

1.58 

6.59 

0.0008 

NcjCI 

1 

0.23 

2.82 

0.0996 

NaCI  XNaCI 

1 

0.61 

7.49 

0.0087 

ProtConc 

1 

2.84 

35.49 

0.0001 

ProtConc  X  ProtConc 

1 

0.85 

10.61 

0.0021 

ProtConc  X  Detergent 

3 

1.65 

6.86 

0.0006 

Red^agent 

2 

0.68 

4.28 

0.0196 

Red_agent  X  Detergent 

6 

1.02 

2.12 

0.0689 

Red_agent  X  Temp 

2 

0.82 

5.10 

0.0099 

Detergent 

3 

7.85 

32.64 

0.0001 

Detergent  X  Temp 

3 

1.93 

8.03 

0.0002 

Temp 

1 

0.84 

10.59 

0.0021 

Residual 

47 

3.76 

Analysis:  Recursive  Modeling 

Recursive  modeling  (FIRM,  Hawkins)  is  based  on 
partitioning  the  data  into  two  or  more  groups  according  to 
the  range  of  values  of  one  predictor.  C^ce  an  initial 
partition  is  obtained,  each  one  of  the  partitioned  groups  is 
divided  into  two  or  more  groups  based  upon  one  the 
remaining  predictors.  The  partitioning  stops  when  the 
group  size  becomes  too  small  to  be  partitioned  or  the  group 
becomes  homogenous.  A  recursive  model  of  our  data 
showed  that  the  data  first  split  with  regard  to  which  type  of 
(tetergent  was  used.  These  subgroups  were  then  split  with 
regard  to  whichever  variable  was  important  (Figure  2). 
The  optimum  predicted  result  from  the  FIRM  anal3rsis  had 
a  mean  predicted  activity  of  1.SS±.31.  This  group  of 
observations  bad  higher  protein  concentrations,  were  stored 
at  high  temperatures  and  contained  n-octoglucoside. 

While  recursive  modeling  is  considered  useful  for  finding 
complicated  interactions  in  large  data  sets,  it  does  have 
some  disadvantages.  FIRM  creates  trees  by  forward 
selection.  The  analysis  stops  when  there  is  a  non¬ 
significant  iq>lit  thus  possibly  hiding  significant 
interactions  below  non-significant  main  effects. 
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FIgur*  2.  FIRM  analysis.  Each  box  Is  numbarad, 
number  of  observations,  mean  and 
standard  deviation. 
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Analysis:  Pick  the  Winner 

Several  of  the  experimental  values  obtained  from  the 
design  points  were  superior  to  those  obtained  in  the 
laboratory  prior  to  this  experiment  (Table  3).  The 
maximum  result  demonstrated  a  higher  activity  when 
compared  to  the  gold  standard  value.  This  area  of  the 
design  space  should  be  further  investigated. 


Tabled 

Linear  Regression  Model  of 
Observed  and  Predicted  Values  of 
Standards  and  Best  Experimental 
Results 


Run 

Observed 

Linear 

Model 

Prediction 

Cross 

Validation 

Prediction 

37 

2,11 

2.05 

1.92 

76 

2.06 

2.03 

2.01 

90 

1.94 

1.90 

1.84 

20 

1.88 

1.56 

1.01 

38 

1.81 

1.17  * 

0.89  • 

66 

1.79 

1.83 

1.91 

27 

1.79 

1.78 

1.77 

79 

1.73 

1.70 

1.67 

29 

1.72 

1.64  • 

1.44  * 

68 

1.69 

1.35  • 

0.84  * 

stdl 

1,44 

1.39 

1.37 

std2 

1,31 

1.39 

1.42 

std3 

1.26 

1.39 

1.46 

Discussion 

Numerous  methods  are  available  for  constructing  space 
filling  designs.  Ouster  analysis  has  been  previously  used 
for  this  purpose,  Zemroch  (1986).  The  ad^tion  of  higher 
order  terms  in  D-optimal  methods  can  also  be  used.  For 
example,  the  insertion  of  a  quadratic  term  into  the  model 
will  force  three  levels  of  the  factor  into  the  design. 
Indicator  variables  for  categorical  variables  will  force  each 
category  into  the  design.  Thus  a  D-optimal  strategy  was 
used  to  construct  this  space  filling  design;  now  that  more 
direct  algorithms  are  available,  they  should  be  used  It  is  an 
open  question  as  to  which  algorithms  are  “best”  and  indeed 
how  to  even  measure  best. 

The  sample  size  of  96  observations  was  chosen  as  the 
maximum  amount  of  protein  material  and  assay  resources 
that  were  available.  There  was  no  attempt  to  reason  how 
many  samples  were  necessary  to  fill  the  sample  space 
adequately.  Such  reasoning  would  depend  upon  the  degree 
subspace  considered  important,  effect  sparsity,  and  the  size 
of  the  gaps  expected  to  be  tolerable.  Univariate  gaps  were 
considered  in  selecting  the  candidate  space,  but  higher 
dimension  gaps  were  not  considered.  The  many  categorical 
variables  in  this  experiment  appear  to  exacerbate  sample 
size  determination.  The  logic  for  selection  of  sample  size 
for  space  filling  designs  remains  an  interesting  problem. 

The  use  of  space  filling  experimental  designs  is  appealing 
for  use  in  industrial  applications  where  a  localized 
maximum  or  "spiked”  response  is  expected.  We  were  able 
to  construct  a  space  filling  design  for  determining  a  set  of 
storage  conditions  for  a  purified  protein  construct.  By 
demonstrating  superior  activity  to  previously  used 
conditions  at  several  areas  of  the  design  space,  we  were 
operationally  successful;  however  we  uncovered  a  number 
of  problems.  At  the  present  time  the  theory  and  software 
for  the  construction  of  space  filling  designs  is  not  well 
developed.  Additionally,  traditional  analysis  procedures 
may  not  adapt  well  to  these  designs.  Some  of  the  problems 
that  will  have  to  be  overcome  include:  Effect  sparsity, 
overfitting  or  multiplicity,  design  space  predictability, 
sample  size  determination  and  a  criterion  for  comparing 
designs. 
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Abstract 

In  this  paper  we  evaluate  the  usefulness  of  the  following 
nonparametric  regression  methods  for  the  analysis  of  a 
space-filling  design:  Gaussian  stochastic  process  models, 
thin-plate  splines,  single  hidden  layer  neural  networks, 
generalized  additive  models  and  multiple  adaptive  re¬ 
gression  splines.  The  space-filling  design  on  which  the 
evaluation  is  based  was  used  to  optimize  the  buffer  for 
a  DNA  amplification  method.  The  methods  were  eval¬ 
uated  based  on  how  well  they  fit  the  data  and  on  the 
reasonableness  of  the  resulting  multidimensional  struc¬ 
tures.  The  methods  of  Gaussian  stochastic  processes  and 
thin-plate  splines  seemed  most  useful  for  this  data  set. 

1  Introduction 

Space-filling  designs  have  been  primarily  used  for  com¬ 
puter  experiments  [1]  [2].  However,  we  believe  that  these 
designs  should  also  be  useful  in  physical  experiments  in 
the  pharmaceutical  and  biotechnology  industries.  For 
example,  the  use  of  a  space-filling  design  has  been  re¬ 
ported  by  Menius  and  Young  [3]  where  it  was  used  to 
discover  storage  buffer  conditions  that  preserved  the  ac¬ 
tivity  of  a  protein  construct,  and  Van  Cleve  [4]  carried 
out  the  space-filling  design  which  we  analyze  in  this  par 
per  in  order  to  optimize  buffer  conditions  for  a  DNA 
amplification  method. 

One  of  the  assumptions  motivating  the  use  of  a  space¬ 
filling  design  is  that  the  response  surface  is  likely  to  be 
highly  nonlinear.  Thus,  a  low  order  polynomial  model, 
as  traditionally  used  for  a  response  surface  design,  will 
not  be  sufl5ciently  flexible  to  capture  the  relevant  struc¬ 
ture  of  the  underlying  surface.  Consequently,  flexibility 
in  the  regression  model  is  critical.  In  this  paper  we  ex¬ 
plore  the  use  of  several  methods  which  can  loosely  be 
classified  as  nonparametric  regression  surfaces  because 
of  the  highly  flexible  nature  of  their  regression  models. 
These  methods  include  Gaussian  stochastic  processes, 
thin-plate  splines,  single  hidden  layer  neural  nets, v  gener¬ 
alized  additive  models,  and  multiple  adaptive  regression 
surfaces. 


The  goal  of  the  analysis  of  a  space-filling  design  is  to 
fit  a  model 

Y  =  /(x)-he 

where  Y  is  a  response  variable  and  x  =  (xi,X2i ...» 
is  the  set  of  experimental  or  predictor  variables.  In  the 
context  of  process  optimization,  it  is  of  interest  to  find 
the  best  settings  of  the  subset  of  important  variables  and 
to  predict  the  response  value  at  these  optimum  settings. 
Scientists  and  engineers  may  also  gain  insight  into  the 
underlying  mechanisms  by  examining  the  structure  of 
the  surface  generated  by  /. 

Two  fundamental  obstacles  to  this  process  are  that 
the  form  of  the  function  /  is  generally  unknown  and  that 
d  is  usually  large.  Because  the  form  of  /  is  unknown, 
an  approximating  function  of  some  sort  must  be  used. 
Consider  the  case  in  which  /  is  approximated  by  a 
order  polynomial;  then  f  will  have 

m  +  d  \ 
m  J 

terms,  growing  like  m^.  The  exponential  increase  in 
terms  as  a  function  of  the  dimension  is  known  as  the 
curse  of  dimensionality  and  this  difficulty  affects  all  ap¬ 
proaches  to  the  problem. 

In  order  to  be  useful  for  the  analysis  of  a  space-filling 
design,  a  nonparametric  regression  model  must  be  flexi¬ 
ble  enough  to  capture  the  multidimensional  structure  of 
the  surface.  In  this  paper,  we  evaluate  the  nonparamet¬ 
ric  regression  models  with  this  criterion  in  mind.  The 
example  used  for  the  evaluation  is  described  in  the  next 
section.  In  Section  3  we  provide  a  brief  description  of 
each  of  the  regression  models  we  evaluated.  Section  4 
presents  the  results  of  the  model  fitting  and  makes  com¬ 
parisons  among  the  different  methods.  Finally,  Section 
5  provides  some  brief  conclusions. 

2  strand  Displacement  Amplifica¬ 
tion 

Strand  displacement  amplification  (SDA)  is  a  method 
for  DNA  amplification  that  was  invented  at  the  Becton 
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Dickinson  Research  Center  [5]  and  [6].  SDA  is  an  isother¬ 
mal  amplification  method  that  utilizes  the  ability  of  an 
enzyme  to  nick  an  unmodified  strand  of  a  hemiphospho- 
rothioate  form  of  DNA  at  its  recognition  site.  DNA  poly¬ 
merase  extends  the  nicked  site  and  displaces  the  down¬ 
stream  DNA  strand.  Exponential  amplification  results 
from  coupling  sense  and  antisense  reactions  in  which 
strands  displaced  from  a  sense  reaction  serve  as  a  tar¬ 
get  for  an  antisense  reaction  and  vice  versa. 

A  space-filling  design  was  conducted  to  study  the  ef¬ 
fects  of  buffer  composition  on  strand  displacement  ampli¬ 
fication  by  Van  Cleve  [4].  Four  buffer  components  were 
systematically  varied  in  a  space-filling  design;  namely, 
KCl,  KPO4,  MgCl2  and  dNTP.  The  two  components 
KCl  and  KPO4  are  thought  to  affect  amplification  pri¬ 
marily  via  their  contribution  to  the  ionic  strength  of  the 
buffer.  Ionic  strength  affects  DNA  hybridization  and 
enzyme  activity.  Each  enzyme  in  the  system  is  likely  to 
have  its  own  optimal  salt  concentration  so  there  may  well 
be  several  local  optima.  The  variable  dNTP  (deoxyri- 
bonucleotides)  represents  the  micromolar  concentration 
of  each  of  the  four  basic  building  blocks  of  DNA  needed 
for  extending  the  nicked  site.  dNTP  binds  one  a 

one-to-one  basis  so  MgCl2  must  be  present  in  at  least 
equal  molar  concentration  as  the  total  dNTP  concentra¬ 
tion  in  order  for  extension  to  take  place.  Since  Mg"^^  is 
also  a  cofactor  for  the  restriction  enzyme,  it  needs  to  be 
in  excess  of  the  total  dNTP  concentration. 

The  design  was  constructed  in  three  stages.  First,  a 
2000  run  Latin  hypercube  design  was  generated.  Second, 
each  factor  was  rounded  to  20  levels.  These  2000  runs 
were  the  candidate  set  of  design  points.  Third,  the  best 
settings  from  previous  experiments  were  specified  as  a 
fixed  point  and  55  additional  design  points  were  selected 
based  on  an  approximation  to  the  maximin  criteria  of 
Johnson,  Moore,  and  Ylvisaker  [8].  The  software  ALEX 
(ALgorithms  for  Efficient  experiments,  Welch  [9])  was 
used  to  generate  the  design.  Some  of  the  runs  were  repli¬ 
cated  and  a  total  of  89  response  values  on  the  56  different 
buffers  were  available  for  analysis 

The  response  value  is  the  counted  intensity  of  an  ap¬ 
propriate  band  on  an  electrophoresis  gel  evaluated  on  a 
Phosphorlmager  (Molecular  Dynamics  Model  425E).  Be¬ 
cause  not  all  of  the  experimental  buffers  could  be  eval¬ 
uated  on  one  gel,  two  replicates  of  a  control  condition 
were  run  on  each  gel.  The  control  condition  represented 
the  best  buffer  from  previous  experiments.  (The  corre¬ 
sponding  settings  were  included  2ls  the  fixed  point  when 
generating  the  design  as  described  above.)  The  values 
were  normalized  for  each  gel  as  follows: 

^normalized  _  ^ 

yij 


where  c  is  the  average  of  all  of  the  control  runs,  and  Cj 
is  the  average  of  the  control  runs  from  gel  j.  Because 
of  the  exponential  amplification,  it  makes  sense  to  an¬ 
alyze  the  response  values  on  the  log  scale  in  order  to 
get  at  the  actual  amplification  rate.  In  this  context,  the 
normalization  can  be  regarded  as  a  forced  additive  day 
effect. 

3  Nonparametric  Regression  Mod 
els 

Nonparametric  regression  can  be  thought  of  as  a  general 
class  of  methods  that  provide  very  flexible  approximat¬ 
ing  functions.  In  this  section  we  describe  the  methods 
that  were  used  to  model  the  data  from  the  experiment 
described  in  the  previous  section. 

3.1  Polynomial  Models 

The  polynomial  regression  model  can  be  expressed  as 
follows 

y  =  p^(x)/3  +  e, 

where  p  is  a  vector  of  polynomial  linear  model  terms,  (3 
are  the  usual  linear  regression  parameters  and  c  is  the 
random  error.  An  order  polynomial  will  include 
power  and  cross  terms  up  to  order  m.  As  m  increases, 
the  flexibility  of  the  polynomial  model  increases  at  the 
expense  of  possible  overfitting.  The  Im  function  in  S- 
PLUS  [7]  was  used  to  fit  the  polynomial  models. 

3.2  Gaussian  stochastic  processes 

In  the  Gaussian  stochastic  process  model,  we  model  Y 

by 

y  =  p^(x)y8  +  Z(x)  +  e, 

where  p  is  a  vector  of  linear  model  terms,  is  the  vector 
of  corresponding  (unknown)  coefficients,  Z(*)  is  assumed 
to  be  a  univariate  Gaussian  stochastic  process  on  the 
design  space,  and  e,  representing  random  measurement 
error,  is  assumed  independent  of  Z{')  and  Gaussian  with 
mean  zero.  In  the  model  fit  in  this  work,  p^(x))3  is 
simply  1^0*  In  this  case,  systematic  dependence  of  Y  on 
X  is  captured  solely  by  the  Z{x)  term. 

Flexible  specification  of  Zf)  is  key  to  capturing  the 
features  of  complex  response  surfaces.  We  consider  only 
mean  zero  Gaussian  stochastic  processes  with  correlation 
functions  of  the  form. 
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/?(x,  x')  =  Cor(^(x),2^(x')) 

d 

-  ]][exp{-0fc|iCjfe-4|P''}. 

k=l 

Previous  work,  Sacks,  Welch,  Mitchell,  and  Wynn 

[1]  and  Welch,  Buck,  Sacks,  Wynn,  Mitchell,  and  Morris 

[2] ,  has  found  this  structure  to  be  sufficiently  flexible  to 
capture  quite  complicated  response  surfaces.  McMillan, 
Sacks,  Welch  and  Gao  [10]  also  reported  the  successful 
use  of  this  method  to  analyze  a  space-filling  design.  The 
essential  idea  behind  this  covariance  structure  is  that 
points  ‘^near”  each  other  in  the  design  space  should  be 
more  correlated  than  points  “far”  from  each  other  with 
the  measure  of  “nearness”  being  individually  scaled  in 
each  dimension  of  the  design  space.  We  note  that  this 
model  is  a  universal  kriging  model  with  the  covariance 
structure  specified  via  R  rather  than  the  traditional  var- 
iogram.  (See  Cressie  [11]  for  a  review  of  kriging.)  The 
correlation  function,  R,  is  more  general  than  variograms 
typically  found  in  the  kriging  literature  (where  d  is  only 
2  or  3)  as  R  does  not  assume  isotropy. 

Now  operationally,  suppose  we  have  n  observations 
of  the  system,  (Yi^xi),  •  •  • ,  (1^,  Xn).  Let  the  vector  of 
responses,  (Yi,  •  •  •,yn)^  be  denoted  by  Y.  The  model 
weVe  described  for  this  data  can  be  written  in  matrix 
notation  as 

Y  =  P/3+Z  +  c, 


(Pi,  •  •  *  and  cr^,  are  not  known.  Available  soft¬ 

ware  (ALEX  [9]  -  implemented  by  Welch)  performs  max¬ 
imum  likelihood  estimation  of  these  quantities.  Esti¬ 
mates  of  0,  p,  (t|,  and  cr^,  thus  obtained,  are  used  in 
Y  for  optimization  of  the  predictor. 

As  one  last  issue  about  this  model  we  evaluate  the 
fit  of  the  model  to  the  data  by  an  empirical  measure  of 
MSE  averaged  over  the  design  points, 


MSE  = 


1 

n  ->•  tr(H) 


^(y(xi)-y(x,-))^ 

i=l 


The  “hat”  matrix,  H,  is  defined  by 


H  =  (I-((t2/<t2)RC-1)P(P^C-^P)-^P^C-^ 

We  use  the  trace  of  the  “hat”  matrix  as  a  surrogate 
for  the  degrees  of  freedom  in  the  model  as  suggested  by 
Wahba  [12]  for  splines. 

3.3  Thin-plate  splines 

The  mth-order  thin-plate  splines  approximating  function 
/  is  the  minimizer  of  the  following  quantity 

^  +  pJmAf) 


where  P  is  the  expanded  design  matrix  with  p^(xi)  in 
the  fth  row,  Z  =  (Z(xi),  •  •  • ,  Z(x„))^  is  the  vector  of 
stochastic  process  values  at  the  n  experimental  settings, 
and  c  =  (fi,'“j^n)^  is  the  vector  of  random  errors. 
We  assume  Z  ^  N(0,(t|R),  where  the  n  x  n  matrix 
R  has  i?(xj,Xj)  as  the  {i,j)  element,  c  ~  N(0,cr^I), 
and  Z  and  c  are  independent.  These  assumptions  imply 
Y  N(P/9,  cr^C),  where  =  cr|  4-  and  the  n  x  n 
correlation  matrix  C  is  given  by  (<7|R  + 

When  R  and  are  assumed  known,  the  best 

linear  unbiased  predictor  (BLUP)  of  Y (x)  is 

y(x)  =  p^(x)y3  +  c(x)^C'^(Y  -  F^). 

2 

Here  c(x)  is  a  vector  with  element  i  given  by  ^iZ(x,  x,), 
the  correlations  between  the  Y’s  at  x  and  the  n  experi¬ 
mental  runs.  The  vector  of  coefficients,  y3,  is  the  general¬ 
ized  least  squares  estimator,  =  (P^C”^P)~^P^C'”^Y. 

We  use  the  estimator  Y  to  make  predictions  regard¬ 
ing  the  response  surface  for  optimization  purposes.  First, 
though,  we  must  handle  the  difficulty  that  the  param¬ 
eters  of  the  covariance  structure,  6  =  (^i,  •  *  • ,  ^d),  P  = 


where  p  >  0,  /  has  square  integrable  partial  derivatives 
up  to  degree  m, 

JmAf)  = 

Ea.+...+a.=m  ( 

and  Jm,d{f)  ^  ^  (Wahba  [12]  and  Nychka,  Ellner  Mc¬ 
Caffrey  and  Gallant  [13]).  ^  general  (rotation) 

invariant  measure  of  the  roughness  in  the  function  /  and 
by  varying  the  value  of  p  we  can  control  the  smoothness 
of  the  regression  surface.  The  value  of  p  is  usually  chosen 
by  cross-validation. 

The  solution  to  the  thin-plate  spline  minimization 

problem  will  be  a  linear  combination  of 

monomials  up  to  degree  m  -  1  and  n  radial  basis  func¬ 
tions.  The  coefficients  in  this  linear  combination  are  lin¬ 
ear  functions  of  Y.  Therefore,  there  exists  an  implicit 
smoother  matrix  S{p)  such  that  Y  =  S{p)Y  where  S{p) 
depends  on  p,  m,  d  and  the  Xi's  but  not  on  Y.  The 
effective  degrees  of  freedom  for  the  regression  model  can 
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be  approximated  by  tr(5(/?))  and  the  mean  square  error 
estimated  as  in  a  similar  manner  as  described  for  the 
Gaussian  stochastic  process  model  (Nychka  [14]).  The 
thin-plate  spline  model  was  fit  using  a  FORTRAN  func¬ 
tion  callable  from  S-PLUS  {ipsreg  [15]  ->  implemented  by 
Nychka). 

For  the  purposes  of  interpretation,  the  thin-plate  spline 
model  can  be  thought  of  as  a  limiting  case  of  the  Gaus¬ 
sian  stochastic  process  model.  In  particular,  if  we  take 
0i  =  6  and  =  2  in  the  Gaussian  stochastic  process 
model,  the  thin-plate  spline  model  arises  as  we  take  the 
limit  as  ^  0.  The  order  of  the  thin-plate  spline  is  by 

default  m  =  (d-h2)/2  so  that  the  polynomial  part  of  the 
model  is  of  degree  m. 

3.4  Neural  Networks 

Single  hidden  layer  neural  networks  can  be  viewed  as 
nonlinear  regression  models;  namely, 

ni  d 

Y  =  00 +  fjfilio  +  ^  JjkXk)  +  e 

jzzl  A:  =  l 

where  the  function  /  simulates  the  on/off  firing  of  a 
single  neuron.  It  is  important  that  it  is  sigmoidal  and 
bounded.  We  take  /  to  be  the  the  usual  squashing  func¬ 
tion  (logistic  distribution  function)  f(u)  =  e“/(l  4*  e^) 
and  j  =  1, . .  .,m  are  units  (nodes)  in  a  single  hidden 
layer,  jjk  are  the  input  weights  for  each  node,  are  the 
weights  of  the  hidden  units,  and  j3o  and  jjo  are  bias  ad¬ 
justments  (constants).  The  neural  network  model  was 
fit  using  a  FORTRAN  program  callable  from  S-PLUS 
{nnreg  [15]  -  implemented  by  Nychka). 

The  neural  net  seems  to  be  a  good  model  for  many 
problems  including  nonlinear  regression  (Cheng  and  Tit- 
terington  [16],  Geman,  Bienenstock,  and  Doursat  [17] 
and  Nychka,  Ellner,  McCaffrey,  and  Gallant  [13]).  How¬ 
ever,  the  (1  -f  m{d  -f-  2))  parameters  are  estimated  by 
nonlinear  least  squares  and  it  is  often  difficult  to  find  a 
global  minimizer.  Dimension  reduction  occurs  because  of 
the  ability  to  look  at  linear  combinations  of  many  vari¬ 
ables.  There  is  similarity  to  the  method  of  projection 
pursuit  in  which  /  is  replaced  by  arbitrary  functions. 

3.5  Generalized  Additive  Models 

An  generalized  additive  model  (GAM)  [18]  approximates 
the  regression  surface  by  a  function  of  the  following  form: 

d 

;=i 


where  each  function  /  is  a  nonparametric  smoothing 
function.  It  is  also  possible  to  specify  a  family  for  the 
error  distribution,  and  we  used  the  standard  gaussian 
family  in  this  example.  A  variety  of  smoothing  functions 
can  be  used,  and  we  used  smoothing  splines  with  4  de¬ 
grees  of  freedom  where  the  degrees  of  freedom  is  equal  to 
tr{S)  —  1  where  S  is  the  implicit  smoother  matrix.  Note 
that  it  is  also  possible  to  include  products  of  the  smooth¬ 
ing  functions  in  the  model  to  fit  a  more  complex  surface. 
We  considered  models  involving  up  to  linear-by-linear 
interactions  and  quadratic  functions  of  the  smoothers  to 
fit  this  example.  The  generalized  additive  models  were 
fit  using  the  function  gam  [19]  in  S-PLUS  [7]. 

3.6  Multiple  Adaptive  Regression  Splines 

Multiple  adaptive  regression  splines  (MARS)  models  were 
proposed  by  Friedman  [20].  The  regression  surface  is  ap¬ 
proximated  by  a  function  of  the  following  form: 

m 

^  =  /?o  +  X]  n  ^ 

j=l  1=1 

where  the  hji  are  piecewise  linear  basis  functions.  The 
value  of  v{jy  1)  is  an  index  of  the  predictor  used  in  the 
/th  term  of  the  jth  product.  The  basis  functions  hji  are 
defined  in  pairs: 

hji{x)  =  [a;  — 

hj,l+i(^)  = 

for  1  an  odd  integer,  where  the  knot  value  is  one  of  the 
unique  values  of  rnodel  is  constructed  in  a 

forward  stepwise  manner  followed  by  pruning  of  the  least 
important  terms.  The  degree  of  the  MARS  fit  specifies 
the  maximum  number  of  terms  allowed  in  any  product 
and  so  controls  the  level  of  interactions  among  predictor 
variables.  MARS  takes  advantage  of  any  low  order  struc¬ 
ture  in  the  response  surface  and  generally  adds  terms  to 
the  model  parsimoniusly.  The  MARS  models  were  fit 
using  the  mars  function  in  the  fda  library  implemented 
in  S-PLUS  [7]  by  Hastie,  Tibshirani  and  Buja  [21]. 

4  Model  Fitting  Results 

Each  of  the  nonparametric  regression  models  described 
in  the  previous  section  was  fit  to  the  experimental  data. 
In  cases  where  there  were  choices  regarding  the  degree 
of  the  model,  a  number  of  alternatives  were  considered; 
in  particular,  polynomial  models  of  degree  1-4  were  fit, 
neural  nets  were  fit  with  2-5  hidden  units,  generalized  ad¬ 
ditive  models  (GAM)  with  were  fit  with  linear  smooths. 
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cross-products  of  linear  smooths  and  powers  of  linear 
smooths,  and  multiple  adaptive  regression  spline  (MARS) 
models  of  degree  1-5  were  fit. 

The  best  model  fitting  results  for  each  method  are 
given  in  Table  1.  The  polynomial,  Gaussian  stochastic 
process,  thin-plate  splines  and  single  hidden  layer  neural 
network  models  each  give  a  satisfactory  fit  to  the  data. 
However,  the  GAM  and  MARS  models  do  not  appear  to 
be  flexible  enough  to  adequately  represent  the  data. 


Table  1:  Model  Fitting  Results 


Model 

Desc. 

DFr 

DFe 

RMSE 

R2 

Polyn. 

Degree=4 

55 

34 

0.83 

94% 

GaSP 

44 

45 

0.83 

93% 

TPS 

Order=3 

53 

36 

0.83 

94% 

NNet 

Units=5 

31 

58 

0.77 

91% 

GAM 

lin.  -h  tfi 

28 

61 

1.41 

84% 

MARS 

Degree=3 

12 

77 

1.66 

48% 

Pure  Error 

33 

0.84 

In  order  to  further  compare  the  four  best  models  on 
how  well  they  capture  the  multidimensional  structure  of 
the  example,  we  ran  an  optimization  routine  (to  maxi¬ 
mize  the  predicted  response)  starting  at  the  conditions 
of  the  best  design  point.  The  results  are  shown  in  Ta¬ 
ble  2.  Note  that  Gaussian  stochastic  process  model  and 
thin-plate  splines  give  results  which  are  quite  similar  to 
the  settings  of  the  best  run.  However,  the  neural  net 
and  polynomial  models  find  optimal  settings  that  are  far 
from  the  starting  point  and  have  implausible  predicted 
response  values. 

Table  2:  Optimization  Results  for  Best  Models 


Model 

KCL 

MgCl2 

KPO4 

dNTP 

yopt 

best  run 

35 

6 

20 

1000 

13.7 

GaSP 

36 

6.2 

21 

975 

13.4 

TPS 

34 

6.1 

20 

975 

13.4 

NNet(5) 

17 

7 

21 

1500 

22.6 

Polyn(4) 

50 

6.8 

45 

650 

454 

Further  insight  into  the  usefulness  of  the  models  can 
be  gained  by  examining  response  surface  and  contour 
plots  for  each  of  the  four  best  models.  These  can  be 
seen  in  Figure  1-4.  The  Gaussian  stochastic  process  sur¬ 
face  (Figure  1)  and  the  thin-plate  spline  surface  (Figure 
2)  are  quite  similar.  However  the  Gaussian  stochastic 


process  surface  suggests  that  there  is  a  local  optima  in 
addition  to  the  global  optima.  The  local  optima  is  close 
to  the  location  of  the  previous  best  runs  (control  set¬ 
tings)  and  so  seems  plausible.  The  neural  net  surface 
(Figure  3)  apparently  is  not  sufficiently  flexible  to  rep¬ 
resent  the  optimum  and  instead  suggests  a  rising  ridge. 
(This  is  why  the  predicted  optimum  moved  away  from 
the  best  run.)  Finally,  the  polynomial  model  (Figure  4) 
introduces  large  variations  in  the  surface  away  from  the 
data  points  in  order  to  obtain  a  good  fit.  The  overall 
surface,  consequently,  is  not  reasonable  due  to  the  over¬ 
fitting. 

5  Conclusions 

The  analysis  of  space-filling  designs  is  challenging  be¬ 
cause  of  the  flexibility  required  in  the  approximating 
function.  It  is  important  to  fit  the  observed  data  well 
while  avoiding  overfitting  and  at  the  same  time  provid¬ 
ing  a  reasonable  representation  of  the  multidimensional 
structure  of  the  surface.  In  this  paper  we  used  a  num¬ 
ber  of  very  flexible  models  which  we  loosely  classified  as 
nonparametric  regression  models  to  fit  the  data  from  a 
buffer  optimization  example. 

Due  to  the  wide  range  of  surfaces  which  might  be 
encountered,  it  is  unlikely  that  there  is  one  best  method 
for  the  analysis  of  space-filling  designs.  In  this  example, 
the  methods  of  Gaussian  stochastic  processes,  thin-plate 
splines,  single  hidden  layer  neural  networks  and  polyno¬ 
mial  models  all  provided  good  fits  to  the  data.  Gener¬ 
alized  additive  models  and  multiple  adaptive  regression 
splines  did  not  fit  the  data  well.  A  more  detailed  exam¬ 
ination  of  the  multidimensional  structure  of  the  fitted 
surfaces  showed  that  only  the  Gaussian  stochastic  pro¬ 
cess  and  thin-plate  spline  models  provided  reasonable 
surfaces. 

In  conclusion,  we  feel  that  the  use  of  space-filling 
designs  provides  a  promising  approach  for  a  wide  class 
of  problems  in  the  pharmaceutical  and  biotechnology 
industries  in  which  it  is  necessary  to  model  complex 
nonlinear  surfaces.  As  we  gain  more  experience  with 
these  designs  and  their  analysis,  we  hope  to  be  able 
to  more  clearly  identify  their  strengths  and  weaknesses, 
offer  guidelines  for  their  effective  use,  and  recommend 
methods  for  nonparametric  regression  methods  for  mod¬ 
eling  the  surface  structure. 
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Figure  1 .  Response  Surface  for  GaSP  Model 


CO 


a> 


MgCI2 


400  800  1200 


118  Analysis  of  Space -Filling  Designs 


Figure  2.  Response  Surface  for  Thin-Plate  Spline  Model 
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Figure  3.  Response  Surface  for  Neural  Net  Model 
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Abstract 

In  this  work,  we  develop  a  method  to  estimate  the 
observed  information  matrix  when  using  Monte  Carlo 
E  M,  for  a  class  of  mixed  models  for  partially  ob¬ 
served/grouped  data.  We  propose  a  Monte  Carlo  sequel 
to  Louis’  method  [3].  Our  method  includes  a  Gibbs  step 
to  generate  variates  from  the  appropriate  densities.  We 
illustrate  the  computations  involved  through  two  exam-* 
pies. 

1  Introduction 

A  computational  drawback  of  the  E  M  algorithm  is  that 
often  the  E  step  involves  hefty,  sometimes  insurmount¬ 
able  calculations  (e.g.,  high  dimensional  integration). 
For  some  problems,  it  may  be  feasible  to  perform  these 
calculations  using  direct  numerical  integration  [4],  al¬ 
though  for  more  complicated  models,  this  might  not  be  a 
computationally  tractable  option.  Tanner  [6]  outlined  a 
Monte  Carlo  E  M  algorithm,  where  the  idea  is  to  replace 
the  integrals  involved  in  the  E  step  with  a  Monte  Carlo 
estimate.  We  develop  a  Monte  Carlo  sequel  to  Louis’ 
[3]  method  to  estimate  the  observed  information  matrix 
within  the  M  C  E  M  framework.  Although  this  approach 
works  quite  generally,  we  have  worked  out  the  details  for 
a  class  of  mixed  models  for  partially  observed/grouped 
data.  By  partially  observed  data,  we  refer  to  censored 
or  truncated  data;  by  grouped  data  we  refer  to  ordered 
categorical  data.  Our  method  includes  a  Gibbs  step  to 
generate  variates  from  the  appropriate  densities.  The 
computations  involved  are  illustrated  through  two  ex¬ 
amples. 

In  Section  2,  we  outline  Louis’  method  and  describe  a 
Monte  Carlo  implementation  of  his  method.  In  Section 
3,  we  formulate  the  class  of  mixed  models  of  interest 
and  describe  the  computations  involved.  In  Section  4, 
we  apply  the  methods  developed  in  Section  3  to  probit 
normal  regression  and  censored  regression. 


2  Louis’  Method 

In  the  usual  E  M  terminology,  we  define  Y  to  be  the 
latent/ compleie  data  with  probability  density  or  mass 
function  denoted  by  [Y  \  9],  where  6  is  the  unknown  pa¬ 
rameter  vector  and  [.]  denote  densities.  However,  we  do 
not  observe  Y ;  instead  we  observe  a  measurable  func¬ 
tion  of  y ,  namely,  W  ~  [W  \  9\,  The  goal  of  E  M  is  to 
find  the  maximum  likelihood  estimate  of  9  based  on  the 
observed  data  W.  The  E  M  method  is  only  attractive 
in  situations  where  finding  the  complete  data  maximum 
likelihood  estimator  and  the  observed  information  ma¬ 
trix  is  straightforward,  but  the  problem  based  on  the 
observed  data  requires  an  iterative  solution. 

Define  the  set  7^  =  {y  :  w{y)  =  u;},  i.e..  It  is  the  set 
of  complete  data  Y  that  could  have  led  to  the  observed 
data  W.  Louis  [3]  proved  that  the  observed  information 
matrix  Iw  (^)  satisfies  the  following  identity: 

lw{0)  =  E{-^\n{Y\B]\Y  en)- 

Var(Ain[y|0]|y  e  7^)  (l) 

The  first  term  in  Iw  {9)  is  simply  the  conditional  ex¬ 
pected  information  matrix  of  the  complete  data  Y  and 
is  typically  easy  to  compute.  Louis  proved  that  the  sec¬ 
ond  term  is  the  expected  information  of  the  conditional 
distribution  of  Y  given  that  Y  lies  in  the  set  U,  In  some 
applications,  it  may  be  computationally  intractable  to 
calculate  the  expectations  in  (1).  Tanner  [6]  suggested 
a  Monte  Carlo  approach  to  Louis’  method  by  replacing 
the  expectations  with  a  Monte  Carlo  estimate,  in  the 
following  way: 

1)  Generate  yi,  y2,  ... ,  Vm  I  ^  ^ 

m  suitably  large. 

2)  Replace  the  first  term  in  Iw  (9)  by 
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We  now  formulate  the  model  of  interest  and  illustrate 
the  computations  involved. 

3  The  Model 

We  consider  the  standard  analysis  of  variance  model  for 
variance  components  estimation: 


r 


y  =  xp  +  y^ZkUk  +  ( 

(2) 

k=l 

Uk  ~  iv,,  (0,  (tI  I) 

(3) 

€  ~  NniO,eU) 

(4) 

where  Y  €  3?”  ^  ^  is  the  data  vector  which  is  par¬ 
tially  observed  or  completely  unobserved.  X  £  ^  ^ 

is  the  design  matrix  associated  with  the  unknown  fixed 
effects  vector  /?  €  ^  ^  and  Zk  E  5ft”  ^  is  the  inci¬ 

dence  matrix  corresponding  to  the  random  effects  vector 
(Ar  =  1,  ...  ,r).  We  use  the  random  effects  structure 
as  a  convenient  way  to  model  the  correlation  among  Y, 
The  parameters  of  interest  are  9  =  (/?,  <^2) 

We  say  a  component  of  Y,  Yi  is  unobserved,  if  the 
only  data  information  available  is  that  it  lies  in  some 
interval  (a,-,  6,)  where  — oo  <  a,  <  6,*  <  oo,  and  at 
least  one  of  a,*,  6,*  is  finite.  Such  applications  arise  when 
“experimental  conditions  or  measuring  devices  permit 
sample  points  to  be  trapped  only  within  specified  limits” 
[1]  as  in  censored  or  truncated  data. 

To  put  this  model  in  the  E  M  framework,  we  define  the 
vector  Y  to  be  the  complete  data,  since  given  Y ,  finding 
the  maximum  likelihood  estimates  and  their  standard 
errors  is  a  normal  linear  regression  problem,  which  is 
easy.  We  define  the  set  TZ  =  {If  :  Yi  =  y,*,  i  E  U;  Yi  : 
ai  <  Yi  <  bi,  i  E  C)  where  C  is  the  set  of  indices 
corresponding  to  the  unobserved  components  of  Y  and 
U  that  for  the  observed  components  of  Y. 

The  complete-data  log  likelihood  is  given  by: 

In[y|^]  «  -^ln|F| 

-i(y  -  X0)'V-^iY  -  Xp) 

where  V  =  YJk=o  Zo  =  !„. 

The  first  term  in  (1)  is  a  matrix  whose  components 
require  the  calculation  of  expectations  of  the  following 
form: 


*E{(J  -  Xp)) 

•  E{{Y  -  X  py  Zk  z'^  z,  z\  v-^  (y  -  x  p)), 

k,l  =  0,  ...,r 

where  all  expectations  are  conditional  on  y  G  TZ. 
The  second  term  involves  expectations  of  the  following 
form: 

•  i;((y  -  xp){Y  -  xp)') 

•  E{(Y  -  X  py  Zk  zi  v-^  (y  -  x  p)), 

k  =  0 . r 

.  Ei(Y  -xp)(Y  -X  py  Zk  zi  {y  -  x  p)), 

k  =  0,  ...,r 

•  E{(Y  -  X  py  y-1  Zk  zi  y-^  (y  -  x  p)iY  - 
X  pyv-^  z,  zi  y-i  {y  -  x  p)),  k,  i,  =  o, ... ,  r 

where  all  expectations  are  conditional  on  Y  E  So, 
in  order  to  obtain  a  Monte  Carlo  estimate  of  Iw  (^)>  we 
need  to  generate  yi,  y2, ...,  ym  ^  [Y\Y  E  9]  and  then 
replace  the  expectations  above  by  sums.  It  is  interesting 
to  note  that  we  do  not  need  to  compute  the  first  two 
expectations  above  separately,  since  E{{Y  —  X/3yA(Y  — 
X/3))  =  trace(^^((Y  ^  X /3)  (Y  ^  X  py)),  for  any 
matrix  A. 

The  density  [Y  |  Y  E  Tly  9]  \s  not  trivial  to  generate 
from,  since  it  is  the  density  of  a  multivariate  normal 
constrained  to  lie  within  a  certain  set  IZ.  We  propose 
the  use  of  the  Gibbs  sampler  to  generate  variates  from 
this  distribution. 

3.1  The  Gibbs  Sampler 

We  now  outline  the  use  of  the  Gibbs  sampler.  In  order 
to  generate  a  sample  of  Y^s  from  the  conditional  dis¬ 
tribution  of  [Y  I  Y  E  'R'i  9],  we  only  need  to  generate 
the  unobserved  components  from  their  full  conditional 
distributions: 

[Yi,  i  eC\Yj,j  ^  i] 

which  is  a  univariate  truncated  normal  distribution,  us¬ 
ing  standard  results  on  normal  theory.  More  formally, 
we  have: 

Step  0)  Obtain  starting  values  for  Yi,  i  E  C. 

Step  1)  For  each  i  E  C,  calculate 

er?|(,.)  =  Var(y  |lS-  =  Vj,  j  #  0 
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and  the  covariance  /?,•  |  (,)  =  cov  (YJ,  Y(,)),  where  Y^i)  — 

{Yu  Y2,  ...,Yi.u  ><+i.  -,Yny- 

Step  2)  For  each  i  6  C,  calculate 

fit  I  (0  =  E{Yi  1 1} ,  i  #  t) 

=  +  (^(0-^(0^) 

where  X(i)  =  X  with  row  i  deleted  and  a:,-  is  the  ith  row 
ofX. 

Step  3)  Simulate  Yi,i  €  C  from  a  truncated  normal  dis¬ 
tribution  with  mean  fti  |  (<)  and  standard  deviation  <Ti  \  (,) , 
truncated  between  (uj,  bi). 

Repeat  Steps  2  and  3  a  large  number  of  times,  NREP 
to  get  ...  Discard  a  suitable  number 

NBURN  of  the  from  the  beginning  of  the  sequence 
and  then  retain  every  NSKIPth  one.  Ofcourse,  we  only 
need  to  run  the  Gibbs  sequence  one  time  to  generate  a 
sample  from  [Y  |  Y  €  11,0].  The  adva.ntages  of  this 
Gibbs  sampling  approach  are  two-fold.  Firstly,  we  only 
ever  need  to  generate  variates  from  univariate  truncated 
normal  distributions,  and  fast  acceptance-rejection  algo¬ 
rithms  exist  to  generate  from  truncated  distributions  [5]. 
Secondly,  most  of  the  computational  effort  is  expended 
in  repeating  Steps  2  and  3  a  large  number  of  times.  Thus, 
complicated  random  effects  structures  have  little  impact 
on  the  computational  time,  because  they  only  affect  Step 
1.  We  verify  our  results  on  two  data  sets  to  illustrate  the 
feasibility  of  the  computations. 

4  Examples 

4.1  Probit  Normal  Regression 

We  consider  a  latent  variable  genesis  of  the  probit  nor¬ 
mal  model  for  binary  data  by  postulating  the  existence 
of  an  underlying/latent  variable  Y.  We  assume  that  Y 
satisfies  the  linear  mixed  model  in  (2-4),  with  the  er¬ 
ror  variance  =  1,  without  loss  of  generality  [2].  We 
observe  a  binary  variable  Wi  =  I{Yi  >  0);  i.e.,  an  indi¬ 
cator  of  whether  Y  crosses  a  threshold  of  0.  An  example 
of  a  situation  where  such  a  threshold  model  might  be 
appropriate  is  with  regard  to  the  iiBa,nci3,I  health  of  a 
firm.  The  observed  variable  is  an  indicator  of  whether 
the  firm  is  bankrupt  (1/0),  while  the  underlying  variable 
represents  the  true  health  of  the  firm.  It  is  unimportant 
whether  we  actually  believe  in  the  underlying  variable, 
Q].  jjierely  use  it  as  a  device  to  estimate  the  parameters 
in  the  model.  The  advantage  of  this  threshold  model  is 


that  it  automatically  lends  itself  to  a  data  augmentation 
approach  such  as  the  E  M  algorithm. 

It  is  easy  to  see  that  R  is  simply  the  intersection  of 
n  half-lines;  if  Wi  =  1,  then  we  consider  the  half-line 
[0,  oo)  while  if  Wi  =  0,  we  consider  (-oo,  0].  Thus, 
in  Step  3)  of  the  Gibbs  sampler,  we  generate  Y  from 
a  normal  distribution,  truncated  above  0  if  Wf  =  1 
and  truncated  below  0  if  Wj  =  0.  We  numerically 
verified  our  results  on  the  Weil  data  set  [7].  This  data  set 
has  a  treatment  and  control  group  and  a  single  nested 
random  effect.  The  response  is  survival  status  of  rats 
and  the  random  effect  is  litter.  The  observed  data  is 
binary  indicating  survival/ death,  and  we  assume  it  arises 
from  a  true  underlying  variable  in  the  following  way: 
Wijk  =  I{Yijk  >  0)  where 

Yijk  =  Pi  +  Uij  -b  iijk 

Uij  ~  W(0,  «r?/) 

ujk  ~  Ar(o,  1) 

where  i  indexes  treatment/control,  j  indexes  litter  and 
k  indexes  the  rat  within  the  litter.  So,  Pi  is  the  group 
mean  on  the  latent  scale  and  the  Uij  are  the  random 
litter  effects.  The  following  table  shows  the  estimates  of 
the  standard  errors  of  the  maximum  likelihood  estimates 
obtained  by  numerical  integration  (Gaussian  quadrature 
with  20  points)  and  our  approach. 


Group 

SE  (M  L  E) 

Numerical  M  C  Louis 

Treatment 

k 

d-i 

0.309  0.304  (0.002) 

0.291  0.297  (0.008) 

Control 

k 

^2 

0.169  0.167  (0.007) 

0.301  0.302  (0.028) 

The  Monte  Carlo  estimate  is  the  average  of  35  inde¬ 
pendent  runs  and  each  run  is  based  on  a  Gibbs  sample  of 
size  1500.  The  numbers  in  parenthesis  are  the  standard 
errors  of  the  Monte  Carlo  estimate.  We  can  see  that 
our  estimates  agree  substantially  with  those  obtained  by 
numerical  integration. 

4.2  Censored  Regression 

We  consider  the  case  where  some  of  the  Y  are  right  cen¬ 
sored.  This  can  occur  when  the  response  is  a  waiting 
time  and  a  typical  member  of  the  population  of  physical 
or  biological  units  is  observed  till  an  event  of  interest 
(or  censoring)  occurs.  Such  data  arise  in  medical  appli¬ 
cations  (time  till  the  first  tumor),  reliability  (repairable 
systems  and  software  reliability)  or  labor  economics  (pe¬ 
riod  of  successive  layoffs). 
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The  observed  data  is  tJtie  pair  (min(yi,  a,),  I  {Yi  < 
ai)j  i  =  1,  ,  n).  The  response  vector  Y  is  assumed  to 

satisfy  the  mixed  model  in  (2-4).  To  put  this  model  in 
the  E  M  framework,  we  define  Y  to  be  the  complete  data. 
It  is  easy  to  see  that  ^  =  {Y  =  i  £  UyYi  >  aiyi  £  C} 
where  U  is  the  set  of  indices  of  uncensored  observations 
and  C  that  for  censored  observations.  Again,  in  Step  3) 
of  the  Gibbs  sampler,  we  simply  generate  the  censored 
Yi  from  a  normal  distribution,  truncated  above  a,-.  We 
applied  our  method  to  a  matched  pairs  skin  graft  data  set 
analyzed  by  Petitt  [4].  This  data  concerns  the  survival 
of  closely  and  poorly  matched  skin  grafts  on  the  same 
person.  The  model  postulated  for  the  logarithm  of  of 
the  survival  time  on  the  subject,  denoted  by 
is: 

Yij  =  /i  +  +  7  Qij  +  €ij 

Pi  ~ 

€ij  ~  <7^) 

where  is  a  single  nested  individual  effect,  fi  is  the 
overall  mean,  7  is  a  fixed  regression  parameter  and  gij  is 
an  indicator  variable  (-1  for  a  poor  match  and  +1  for  a 
good  match).  There  were  2  censored  observations  in  this 
data  set.  We  compared  our  results  on  the  standard  er¬ 
rors  of  the  fixed  effects  parameters,  with  those  obtained 
by  Petitt  and  they  are  displayed  below. 


Parameter 

S  £  (M  L  E) 

Petitt  M  C  Louis 

0.15  0.149  (5.0276-05) 

T 

0.082  0.086  (6.297e-05) 

References 

[1]  Dempster,  A.  P.,  Laird,  N.  M.  and  Rubin,  D, 
B.  (1977)  “Maximum  Likelihood  Estimation  from 
Incomplete  Data  via  the  E  M  algorithm”,  J.  R. 
Siaiisi,  Soc,  B,  Vol  39,  1-38. 

[2]  Harville  D.  A.,  Mee  R.  W.  (1984)  “A  Mixed- 
Model  Procedure  for  Analyzing  Ordered  Categor¬ 
ical  Data”,  Biomeiricsy  Vol  40,  393-408. 

[3]  Louis,  T.  A.  (1982)  “Finding  the  Observed  Infor¬ 
mation  Matrix  when  Using  the  E  M  Algorithm”,  J. 
R.  Statist  Soc.  By  Vol  44,  226-233. 

[4]  Petitt,  A.  N.  (1986)  “Censored  Observations,  Re¬ 
peated  Measures  and  Mixed  Effects  Models:  An 
Approach  using  the  E  M  Algorithm  and  Normal  Er¬ 
rors”,  Biometrikay  Vol  73,  635-643. 

[5]  Robert  C.  P.  (1991)  “Simulation  of  Truncated  Nor¬ 
mal  Variables”,  Technical  Report  No.  161,  LSTA, 
University  of  Paris  6. 

[6]  Tanner  M.  A.  (1993)  “Tools  for  Statistical  Infer¬ 
ence”,  Second  Edition,  Springer- Verlag,  NY,  1993. 

[7]  Weil,  C.  S.  (1970)  “Selection  of  the  Valid  Number  of 
Sampling  Units  and  Consideration  of  their  Combi¬ 
nation  in  Toxicological  Studies  Involving  Reproduc¬ 
tion,  Teratogenesis,  or  Carcinogenesis”,  Food  and 
Cosmetic  Toxicologyy  Vol  8,  177-182. 


The  Monte  Carlo  estimate  is  the  average  of  50  inde¬ 
pendent  runs  and  each  run  is  based  on  a  Gibbs  sample 
of  size  2000. 


5  Conclusion 

In  this  paper,  we  develop  a  method  to  estimate  the  stan¬ 
dard  errors  of  the  maximum  likelihood  estimates  for  a 
class  of  mixed  models  for  incomplete  data.  Our  approach 
is  a  valuable  contribution  to  the  existing  literature  on 
likelihood  inference,  since  we  are  now  able  to  make  in¬ 
ferential  statements  in  situations  where  it  may  not  even 
be  possible  to  compute  the  likelihood  function  with  any 
reasonable  degree  of  precision.  In  addition  to  the  exam¬ 
ples  discussed  here,  we  have  implemented  our  method  for 
the  Ordinal  Probit  model,  Tobit  regression  and  obtained 
satisfactory  results. 
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ABSTRACT 

The  state-space  representation  of  a  regression  model 
for  longitudinal  data  simplifies  the  handling  of  missing 
data  and  measurement  error.  In  this  model,  a  continuous 
response  depends  on  the  lagged  response  and  both  time- 
dependent  and  time-independent  covariates.  The  baseline 
response  depends  only  on  covariates.  Both  the  EM 
algorithm  and  Gibbs  sampling  are  used  to  fit  the  model.  In 
EM,  the  E-Step  uses  the  Kalman  filter  and  associated 
filtering  algorithms  to  update  the  unknown  true  response 
and  predictor  series  for  the  observed  data.  The  M-Step  uses 
standard  closed-form  expressions  for  Gaussian  data.  Gibbs 
sampling  offers  a  straightforward  way  to  compute  Bayesian 
answers  and  some  extensions  to  the  model. 

1.  INTRODUCTION 

Longitudinal  data,  common  in  many  scientific  fields, 
consist  of  a  collection  of  short  times  series  taken  on 
different  units  or  individuals.  Often,  this  collection  of  times 
series  may  be  related  to  other  collections  of  short  series  by 
regression.  The  regression  model  defines  the  conditional 
multivariate  mean  and  covariance  structure  of  the 
collection  of  response  series  given  the  predictor  series  and 
possibly  fixed  covariates.  Since  observations  taken  on  the 
same  individual  at  different  times  are  usually  correlated, 
the  covariance  structure  can  be  quite  complex. 

First-order  autoregressive,  AR(1),  models  are  attractive 
for  longitudinal  models  with  short  series  since  they  require 
only  one  correlation  parameter  and  possess  geometric  rates 
of  decay.  We  shall  consider  a  form  in  which  the  response 
itself  takes  an  autoregressive  form  as  a  function  of  the 
lagged  response  and  both  time-varying  and  time-invariant 
covariates.  This  form  allows  a  time-varying  covariate  to 
affect  all  responses  during  and  after  the  time  at  which  the 
covariate  is  measiued. 

This  model  quite  naturally  generalizes  the 
autoregressive  time  series  model  to  include  terms  for  trend 
which  describe  a  regression  on  covariates.  It  has  been 
called  by  a  variety  of  different  names  in  the  literature 
including  the  state  dependence  model  (Anderson  and 
Hsaio,  1982),  the  conditional  autoregressive  model  (Rosner 
and  Munoz,  1992)  and  the  transition  (Markov)  model 
(Zeger  and  Liang,  1992).  Interpretation  of  the  model 


parameters  requires  some  care  since  the  lagged  response 
appears  on  the  right  side  of  the  regression  equation 
(Schmid,  1994). 

Schmid,  Segal,  and  Rosner  (1994)  showed  how  to 
calculate  maximum  likelihood  estimates  for  this  model 
using  a  Newton-Raphson  algorithm  when  the  response  and 
the  possibly  time-varying  covariates  were  subject  to 
quantifiable  random  Gaussian  measurement  error.  This 
implementation  had  several  limitations,  however.  It 
involved  complex  analytic  derivative  calculations,  could  not 
handle  missing  values,  did  not  provide  any  intuition  into 
the  sequential  nature  of  the  longitudinal  data,  and  finally 
could  not  be  easily  extended  to  more  general  models. 

A  state-space  representation  of  the  longitudinal  model 
treating  the  unobserved  true  series  as  missing  data  helps  to 
rectify  these  problems.  Considering  the  longitudinal 
regression  model  as  a  time  series  model  with  a  trend 
component,  this  state-space  representation  can  be  thought 
of  as  a  generalization  of  the  one  proposed  by  Shumway  and 
Stoffer  (1982)  for  time  series.  The  state  equation  describes 
the  regression  model  of  the  true  series  and  the  observation 
equation  describes  the  measiurement  enor  in  observing 
these  series.  Any  truly  missing  observations  in  any  of  the 
observed  series  can  be  adjusted  for  in  the  observation 
equation.  This  representation  leads  to  straightforward 
sequential  computation  by  both  the  EM  algorithm  and 
Gibbs  sampling  allowing  for  Gaussian  missing  data  and 
measurement  error  models.  These  iterative  numerical 
algorithms  can  be  easily  extended  to  other  models.  Section 
2  sets  forth  the  state-space  model.  Section  3  describes  the 
EM  algorithm  and  Section  4  lays  out  the  Gibbs  sampler  and 
some  potential  extensions  of  the  model.  Section  5  presents 
an  application  to  the  measurement  of  pulmonary  fimction. 

2.  THE  STATE-SPACE  MODEL 

A  general  conditional  first-order  autoregressive  model 
for  the  outcome,  of  the  ith  individual  at  the  tth  time 
incorporating  measurement  error  can  be  written 

y,f  =  a + +  e,-,  (1) 

where  i  indexes  the  N  individuals,  t  indexes  the  T  distinct 
times  of  measurement,  and  the  are  random  errors 
independent  for  all  times  and  individuals  with  a  common 
N(0,a^)  distribution.  Here  x,,  is  a  vector  of  time- 


126  A  State-Space  Model  for  Longitudinal  Regression 


dependent  and  time-independent  covariates  measured 
without  error,  z,-  is  a  vector  of  N2  time-independent 
covariates  measured  with  error;  s,,  is  a  vector  of  Ng  time- 
dependent  covariates  measured  with  error  at  time  t.  The 
model  parameters  are  a,  y,  p,  5  and  The  first  three 
are  scalars  and  the  last  three  are  vectors  of  length  Nx, 
and  Ns ,  respectively. 

At  the  baseline  visit,  the  outcome  for  the  ith  individual, 
y^Q,  is  solely  a  fimction  of  the  covariates  at  that  visit  given 
by 

The  baseline  regression  parameters  are  Oq,  kq,  P  ,  8  and 
I  with  the  last  three  vectors  of  length  Nx,  N^  and  Ng , 
respectively.  The  are  random  errors  independent  of 
each  other  and  of  the  for  t  >  0  and  follow  a 
distribution. 

Because  of  the  measurement  error  and  missing  data, 
the  covariates  z,-  and  s„  are  also  stochastic  quantities  that 
we  model  as 

Sft  =  +  y  s^it-\  +  PjXir  +8^2^+  (3) 

s/o  =  “iO  +P^*»0  +5^2/  +Ei/o 

z,=a2  +  pJx,o+e2,-  (5) 

with  independent  errors  ~iV(0,Z^),  e^^q 
and  E,.  ~N(Q,Y.2).  In  (3)  -  (5),  the  regression  parameters 
(with  oimensions)  are  {Ng  x  1),  y  ^  {Ng  x  N^,  p,  (N gH. 

Ay,  5^  {Ng  X  N^),  a.g^  {Ng  x  1),  p,^  {Ng  x  Nf),  8,^  {Ng  x 
N^,  ag  {Ng  X  1),  p^  {Ng  X  Ay,  2^  {Ng  x  N^,  ^g^  {Ng  x 
N^  and  2^  {Ng  x  Ay.  Together,  equations  (1)  -  (5)  may  be 
combined  into  the  state  equations 

Pj(  =  Fpa_]  +  Gd,,  +  t  =  1,  2,  ...,T  (6) 

and  P,o  =God,o  +  Qoeio  *  =  0  (7) 

with  Pif  =(y„Sjf2.).  d,/=(lx„),  e,'f  =(E,f85,.y~AA(0,2) 
and  eiQ  =  {ziQegfQZg^)~N{0,Y.Q).  F  (l+Ng+Ny  x 
(1+Ns+Ny,  G  (l+Ng+Ny  X  (1+Ny,  Q  (l+Ng+Ny  X 
(l+Nj),  Go  (I+N5+N2)  X  (1+Ny  and  Qo  (l+N^+Ny  x 
(l+Ng+Ny  are  transition  matrices  derived  fi'om  equations 
(1)  -  (5).  It  is  worth  emphasizing  that  equations  (1)  -  (5) 
describe  an  autoregressive  process  based  on  true  rather  than 
observed  data. 

To  complete  the  state-space  model,  we  relate  the 
observed  series  p*  to  the  imobserved  series  p,^  by  the 
observation  equations 


p*  =  hpit  +  d>d,Y  +Q,/  t  =  1,  2, ....  T  (8) 

and  P*o  =AoP,o  +  ®0‘*iO+^^0  t  =  0  (9) 

where  p,*  ={ylsl),  p*o  =  (>'*oS*o**).  ~ 

0/0  ~  A(0,2£2  ).  A  (l+Njj)  x  (1+Ng+Ny,  O  (l+N^)  x 
(1+Ny,  Ao  (l+Ng+Ny  X  (l+Ng+Ny  and  ®o 
(1+Ng+Ny  X  (1+Ny  are  the  transition  matrices  relating 
file  observed  and  true  series. 

The  observation  equations  (8)  and  (9)  describe  a 
systematic  measurement  error  model  with  the  observed 
values  related  to  the  tme  values  by  a  linear  regression.  The 
regression  errors  O,/  may  be  correlated  across  covariates 
for  the  same  individual,  but  are  independent  across 
individuals.  The  random  measurement  error  model  is  a 
special  case  with  all  elements  of  the  transition  matrices 
zero  except  the  diagonal  elements  of  A  and  Aq. 

We  shall  assume  that  the  transition  matrices  and 
covariance  matrices  2q  and  2Qg  in  the  observation 
equations  are  known.  Otherwise,  they  may  be  estimated  if 
multiple  observed  data  series  are  available.  When  multiple 
measurements  are  unavailable,  we  can  investigate  the  effect 
of  different  measurement  error  models  through  sensitivity 
analyses  or  by  incorporating  information  firom  some 
external  data  source,  possibly  Ity  averaging  over  a  given 
measurement  error  distribution  (Schmid  and  Rosner,  1993). 

3.  FITTING  BY  THE  EM  ALGORITHM 

The  EM  algorithm  maximizes  the  expected  complete- 
data  likelihood  where  the  expectation  is  taken  with  respect 
to  the  distribution  of  the  missing  data.  In  this  problem,  the 
complete  data  consist  of  the  observed  series  p*,  and  and 
the  unobserved  true  series  p,/.  To  simplify  notation  for 
writing  the  CJaussian  complete-data  log  likelihood,  express 
the  right-hand  sides  of  equations  (1)  -  (5)  as,  respectively, 
e^H^^,  0^oH,o,  and  G.H,,.  In  these 

expressions,  the  O's  represent  the  model  parameters  and  the 
H's  represent  the  model  covariates.  Twice  the  negative  log 
likelihood  is  then  written 


N(logag  +Tloga2  +loglZ^J+Tlog|2:5|+log|E2l) 
N 

+  y  (^lO  "  )  /  ^0 

1=1 

’^(s,0  “  ® y  2^-yO  (ho  ^  ) 
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/=1 

/=1 

+(P|*0  ~  ^oPiO  “  ®0^i0  ^4)  ^P'O  ~  ■^oPiO  “  ®0<li0  ) 

T 

+I(Prt - - ^it)" (Prt -^P»^  - ^«)>  (10) 

t=l 

Assuming  known  parameters  in  the  observation  equation, 
the  parts  of  the  log  likelihood  above  coming  from  the 
observation  equation  are  absorbed  into  the  constant,  so  that 
the  expectation  of  the  log  likelihood  with  respect  to  the 
missing  data  distribution  is  proportional  to  the  first  six 
lines  of  (10). 

The  sufficient  statistics  when  the  expectation  of  this 
complete-data  log  likelihood  is  taken  with  reject  to  the 
missing  data  distribution  are  then  the  conditional  first  and 
second  moments  of  the  state  vector  p,-^  given  the  observed 
data.  For  example,  E(Piop’)  represents  the  conditional 
expectation  of  the  state  vector  at  time  0  given  all  the 
observed  data  (i.e.,  through  time  T).  Schmid  (1994) 
provides  the  details  of  these  expressions. 

The  E-Step  calculates  the  conditional  means,  variances 
and  covariances  of  Pu  given  the  observed  data  involved  in 
these  sufficient  statistics  by  applying  the  Kalman  filter 
(Brown  and  Schmid,  1994;  Meinhold  and  Singpurwalla, 
1983),  fixed  interval  smoothing  algorithm  (Ansley  and 
Kohn,  1982)  and  state-space  covariance  algorithm  (DeJong 
and  MacKinnon,  1988)  to  each  individual  in  the  study. 
First,  the  Kalman  filter  sequentially  computes  the 
conditional  moments  of  each  p/^  given  the  observed  data 
through  time  t,  e.g.,  E(Pit^),  as 

^(P/rlf-1 )  =  EE  (PjY-l|f-l ) + Gd,f 

f^(P4-l)  =  ®'F(p,Y_iif-i)F^+Q^^(e,f)Q^ 

Kf  =  V(Pitit_i  )A^  [AF(pft„_l)  +  Eq  ]- 

E(Prt|r )  =  ^(pfti,-!) + Kt [p*  -  AE (pitif-i )  -  m,t  ] 

where  F  (e,-^)  is  a  block  diagonal  matrix  having  elements 
The  generalized  inverse  is  necessary  in  the 


computation  of  because  rows  and  columns  of  the  matrix 
corresponding  to  z,-  will  be  all  zeroes  for  t  >  0.  To  initialize 
the  filter,  set  £:(p,o)  =  God,o  and  V  (p,o)  =  Qo^(eiO)Q5 
with  V(e,o)  a  block  diagonal  matrix  having  elements 
Th®  forward  step  gives  the  correct 
expectation  and  covariance  for  Pj-y,  but  the  estimated 
moments  for  p,^  for  t  <  T  are  incompletely  updated,  using 
only  data  up  to  time  t. 

To  complete  the  E-Step,  work  backward  with  the  fixed 
interval  smoothing  and  covariance  algorithms  from  time  T 
to  time  0,  updating  the  moments  of  p^  for  p*„  and 
when  u>t  by 

E(p,r-i|r)  =  E(p,Y-i|r-i)  +J/-i[-E(P/f|r)-E(p,r|f-i)] 

^  (Pit-ifr)  =  ^  (Pit-llt-l ) 

and 

Cov(p,Y-i  ,p/r|r )  =  Jir-1  V(p,f|7’ ) 

Each  calculation  in  the  backward  step  requires  only  output 
from  the  previous  step  of  the  backward  filter  and  the  tth 
step  of  the  forward  filter. 

Application  of  standard  Gaussian  techniques  in  the  M- 
Step  fficn  gives  maximum  likelihood  estimates.  Again, 
det^s  may  be  foimd  in  Schmid  (1994). 

When  values  in  the  observed  series  are  missing,  the 
corresponding  values  in  the  true  series  become  unknown 
even  if  they  are  not  measured  with  error.  Hence,  any  series 
with  missing  values  must  be  part  of  p,,  -  If  the  series  has  no 
observational  error,  then  the  appropriate  elements  of  £q 
are  set  to  zero.  In  the  filter  algorithms,  both  the  missing 
values  in  p,*  and  the  corresponding  rows  of  A  and  Aq  are 
set  to  zero.  This  gives  the  proper  estimates  assuming  that 
the  data  are  missing  at  random  (Shumway  and  Stoffer, 
1982).  Because  the  state  vector  consists  of  conditionally 
Gaussian  random  variables,  missing  data  on  discrete  or 
other  non-Gaussian  variables  cannot  be  handled  by  these 
algorithms.  Further  details  of  this  EM  algorithm  may  be 
found  in  Schmid  (1 994). 

4.  FITTING  BY  GIBBS  SAMPLING 

Gibbs  sampling  has  become  a  popular  tool  for 
numerically  computing  the  posterior  distribution  in 
Bayesian  models.  It  works  by  sequentially  drawing  firom  the 
conditional  distribution  of  each  random  variable  given  the 
latest  drawn  values  from  all  the  other  random  variables  in 
the  model.  In  a  problem  with  complete  data  Y  and 
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parameters  0,  the  algorithm  involves  iteratively  drawing 
from  the  distribution  of  Y|0  and  then  from  0|Y  until 
convergence  is  achieved.  The  final  drawn  values  will  be 
from  the  correct  conditional  distributions  imder  some 
general  conditions  (Geman  and  Geman,  1984).  The  correct 
implementation  of  the  Gibbs  algorithm  requires  knowing: 
1)  the  conditional  distributions  Y|0  and  0|Y;  2)  how  to 
draw  from  these  distributions;  and  3)  how  to  assess 
convergence.  We  shall  not  address  the  third  point  here  but 
refer  the  reader  to  the  literature  (e.g.,  Gelman  and  Rubin, 
1992;  Roberts.  1992) . 

In  this  longitudinal  regression  model, 
Y  =  {Prt.P, *.*;/}  for  all  i,  tand0=  (0^,  0_^o.  9,.  ^so> 
a^,  cTo,  Z,,  Z^,  Zj}.  Using  a  suitable  prior  distribution 
for  0,  Gibbs  sampling  then  involves  calculating  the 
distributions  of  (1)  P,/|Prt-i,Prt+i,P,-^,Xft,x,/+l,0,  (2) 
|Prt  ,ift  ,0  and  (3)  0|p,p*,x  where  p,  p*  and  x  represent 
the  collection  of  all  p„,  p,*  and  x„,  respectively.  Working 
directly  with  the  distributions  of  p,,  and  p,*  is  more 
efficient  than  working  with  the  ffistributions  of  the 
individual  components  of  p„  that  follow  from  equations 
(l)-(5).  By  collecting  all  terms  involving  p^,  in  the 
likelihood,  we  can  show  the  distribution  of 
Prt  IP,7-1  .Prt+1  .%+l  .0  to  be  N(Bb,  B)  where 

B  =  [Z-^  +FTZ-1f+aT(m/o2QoM/5)‘’Ao]-1 
b  =  2-1  God, 0  +F%*(P,1  -Gd,l) 
+Aj(vpo2noM'5)"^(P*0  -^Od/O) 

ift  =  0; 

B  =  [  ZqI  +  F^ZqIf + A^  ( vZnxyt  )-i  a]-i 


P«|P,r.X«.^  is  normal  with  mean  and 

variance  Eq. 

Under  the  standard  noninformative  (constant)  prior, 


0y|p,p*,x,cr2  ~  A^[(H/Hp-'(H/y),<72,(H/H^)-'] 
0,,o|p.p*,x.c>^  ~  .?^[(H^o'^H,o)-‘(H^o'yo).cr?(H/H^o)”] 

ejp.p’.x.z,  ~  a^[(h;z;‘h,)-‘(h;z;'s),(h/z;’h,)-'] 
e^|p,p'.x.Z,o  ~  A^[(H,o"r;o’H^)-(H,o"z^’so).(H/z;o’H^)-] 
ejp.p*.x,z,  ~  A^[(H/z;>H,)-(H/z;'z).(H/z;'H,)-‘] 

where  y,  yg,  s,  Sq,  z,  H^,  and  H^,  are 

formed  by  stacking  their  respective  elements. 

The  posterior  distributions  of  <r^|p,p*,x  and 
a||p,p*,x  are  inverse  chi-square  distributions  under  the 
standard  noninformative  prior  for  variances  and  those  of 
Z^|p,p*,x,  Z,olp,P*,x  and  ZJp,p*,x,  are  inverse 
Wishart  distributions  (Box  and  Tiao,  1973).  The  Gibbs 
sampler  then  consists  of  repeated  sequential  draws  from 
these  conditional  Gaussian  and  Wishart  distributions. 

The  flexibility  of  Gibbs  sampling  can  facilitate 
computation  in  extensions  of  this  model.  One  extension 
incorporates  between-individual  variability  not  captured  by 
the  regression  covariates  by  letting  the  regression  intercepts 
be  random  effects  varying  across  individuals.  Another  uses 
higher-order  autoregressive  and  moving  average  terms  in 
an  ARMA  structure  with  trend  given  by  covariates.  Non- 
Gaussian  errors  and  non-linear  terms  are  other  possible 
extensions  (Carlin,  Poison  and  Staffer,  1992) 

5.  EXAMPLE 


b  =  ZqI  (Fp,f_i  -  Gd/f) +F^Zq1(p,,+i  -  Gd,Y+i ) 


and 


+A’'(v2qm/’^)  ‘(P*, -<M,/) 

ift=l,2 . T-1; 


B  =  [Z-l+AT(vZoV^)-‘Ar* 

b  =  Z-l(Fp,r_i  -Gd,7’)+A^(\|/ZQ\|/^)-’(p^^  -OdjT-) 

ift  =  T. 


Likewise,  the  distribution  of  P*oIp,oj*,o>^  fb*"  nussing 
values  follows  a  normal  distribution  with  mean 
AqP/o +®o‘*(0  2nd  variance  Zq^j,  while  that  for 


Using  EM,  the  model  was  successfully  applied  to  fit 
six  years  of  pulmonary  function  measurements  on  1S8 
chili'en  in  the  Childhood  Respiratory  Disease  Study 
(Redline,  Tager,  Segal,  Gold,  Speizer,  and  Weiss  1989) 
despite  a  substantial  number  of  missing  observations.  The 
response  forced  vital  capacity  (FVC),  the  greatest  volume  of 
air  a  subject  can  forcefully  expel  in  6  seconds  from  total 
limg  expansion,  was  expressed  as  a  frmction  of  a  child's 
age,  sex,  height  and  airways  response  to  a  cold  air 
challenge.  A  total  of  38  percent  of  the  airways  response,  7 
percent  of  the  height  and  12  percent  of  the  FVC 
measurements  were  missing.  Details  of  the  analysis  may  be 
found  in  Schmid  (1994)  which  also  describes  an  analysis 
adjusting  for  measurement  error  in  FVC  and  airways 
response  that  was  measured  externally  (Redline,  Tager, 
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Speizer,  Rosner,  and  Weiss  1989). 
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Abstract 

Residual  maximum  likelihood  estimation  (REML)  is  of¬ 
ten  now  the  preferred  method  for  estimating  parameters 
in  linear  models  with  correlated  or  heteroscedastic  er¬ 
rors.  This  note  shows  that  the  residual  likelihood  is  a 
conditional  likelihood  where  the  conditioning  is  on  an 
appropriate  sufficient  statistic  to  remove  dependence  on 
nuisance  parameters.  This  interpretation  allows  a  very 
concise  derivation  of  the  REML  likelihood  without  the 
need  for  transformation  and  generalizes  naturally  and 
exactly  to  non-normal  models  in  which  there  is  a  minimal 
sufficient  statistic  for  the  fitted  values.  The  conditional 
interpretation  of  REML  is  applied  to  dispersion  mod¬ 
elling  in  generalized  linear  models.  It  is  also  applied  to 
estimate  the  index  parameter  in  a  power- variance  family 
of  generalized  linear  models. 

1  Introduction 

Consider  the  general  linear  model 
y  =  -f  e 

where  y  is  an  7i  x  1  vector  of  responses,  X  is  an  n  x  p 
design  matrix  of  full  column  rank  and  e  ^  A^(0,f2)  is 
a  random  vector.  The  variance  matrix  is  a  function 
of  a  g-dimensional  parameter  7,  and  is  assumed  positive 
definite  for  7  in  a  neighbourhood  of  the  true  value.  For 
any  given  value  of  7,  maximum  likelihood  or  generalized 
least  squares  lead  to  the  estimator 

for  j3.  The  problem  considered  in  this  paper  is  the  esti¬ 
mation  of  7. 

Patterson  and  Thompson  (1971)  introduced  residual 
maximum  likelihood  estimation  as  a  method  of  estimat¬ 
ing  variance  components  in  the  case  of  unbalanced  in¬ 
complete  block  designs.  The  actual  derivation  of  the 
likelihood  function  was  somewhat  involved,  and  this 
prompted  Harville  (1974),  Cooper  k  Thompson  (1977) 
and  Verbyla  (1990)  to  give  alternative  derivations.  In 
all  of  these  the  residual  likelihood  is  represented  as  the 


marginal  likelihood  of  the  error  constrasts.  This  makes 
generalization  of  the  residual  likelihood  principle  to  non¬ 
linear  models  or  non-normal  distributions  difficult  since 
zero  mean  error  contrasts  do  not  generally  exist.  The 
purpose  of  this  note  is  to  show  that  the  residual  likeli¬ 
hood  can  be  viewed  also  as  a  conditional  likelihood  where 
the  conditioning  is  on  an  appropriate  sufficient  statistic 
to  remove  dependence  on  the  nuisance  parameters.  This 
interpretation  may  be  of  use  in  teaching  because  it  clar¬ 
ifies  the  motivation  for  residual  maximum  likelihood  es¬ 
timation  and  because  it  allows  a  very  concise  derivation 
of  the  REML  likelihood  without  the  need  for  transfor¬ 
mation  of  the  data.  It  generalizes  naturally  and  exactly 
to  non-normal  models  in  which  there  exists  a  minimal 
sufficient  statistic  for  the  fitted  values. 

The  plan  of  this  paper  is  as  follows.  Conditional  like¬ 
lihoods  are  discussed  briefly  in  Section  2.  The  condi¬ 
tional  derivation  of  REML  is  given  in  Section  3,  and  its 
generalization  to  generalized  linear  models  in  Section  4. 
Section  5  discusses  dispersion  estimation  in  generalized 
linear  models,  including  the  case  where  the  dispersion  is 
modelled  using  a  link-linear  model  as  in  Smyth  (1989). 
Section  6  discusses  the  estimation  of  parameters  in  the 
variance  function,  in  a  case  where  the  exact  likelihood 
can  be  specified.  Emphasis  in  Sections  5  and  6  is  given 
to  the  one-way  experimental  layout,  since  in  this  case 
the  conditional  likelihood  can  be  written  down  in  closed 
form.  In  other  cases  numerical  evaluation  or  asymptotic 
approximation  is  necessary,  and  methods  to  do  this  are 
discussed  also. 

2  Conditional  Likelihood 

Consider  an  arbitrary  likelihood  function  L(y;^,7) 
where  is  a  vector  of  nuisance  parameters.  If  there 
exists  a  statistic  t(y;7),  possibly  depending  on  7,  that 
is  sufficient  for  /3  then  the  nuisance  parameters  can  be 
eliminated  from  the  likelihood  by  conditioning  on  t.  If 
the  maximum  likelihood  estimation  of  /3  is  a  one-to-one 
function  of  t,  then  it  can  be  argued  that  there  is  no  avail¬ 
able  information  in  t  about  7  in  the  absence  of  knowl¬ 
edge  of  /3,  i.e.,  the  information  in  t  is  entirely  consumed 
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in  estimating  Therefore  there  should  be  no  informa¬ 
tion  loss  in  the  conditional  approach.  The  parameter  of 
interest,  7,  can  be  estimated  by  maximizing  the  condi¬ 
tional  log-likelihood  4h(y;T)  =  -  ^t(y;/3,7) 

which  does  not  depend  on  j3. 

The  idea  of  conditioning  to  remove  nuisance  param¬ 
eters  is  an  old  one  (Bartlett,  1936,  1937).  Kalbleisch 
and  Sprott  (1970)  give  an  extensive  discussion  including 
the  cELse  in  which  t  depends  on  7.  General  expressions 
for  approximate  conditional  likelihoods  based  on  saddle 
point  approximations  have  been  developed  by  Barndorff- 
Nielsen  (1983)  and  Cox  and  Reid  (1987).  A  long  chain 
of  related  work  is  referenced  in  Cox  and  Reid  (1987)  and 
McCullagh  and  Nelder  (1989,  Chapter  7).  Specific  ap¬ 
plication  to  generalized  linear  models  in  made  by  Davi¬ 
son  (1988). 

3  A  Conditional  Derivation 

Let  y  and  ^  be  as  in  Section  1.  For  any  ^ 
is  complete  and  minimal  sufficient  for  so  we  can 
eliminate  /3  from  the  likelihood  by  conditioning  on 
Since  ^  the  conditional  log- 

likelihood  is  ^y|^(y;7)  -  4(y;^>T)  -  = 

-f  log(27r)  ^  ilog|f2|  ~  ^(y  -  Xl3fnyy  ^  X^)  + 
§  log(27r)  -  \  log  \X^^l-^X\ -h  ^(iS  -  l3fX^Q-^XC^  - 
/3)  =  ^  log(27r)  - 1  log  lfi|  -- 1  log  \X^Q-^X\ -  ly^Py 
where  P  =  This  dif¬ 

fers  from  the  likelihood  function  given  by  Harville  (1974) 
and  Cooper  and  Thompson  (1977)  only  in  that  it  lacks 
the  constant  Jacobian  term,  -^\og\X'^X\,  since  no 
transformation  of  the  data  has  been  used. 

That  the  conditional  likelihood  is  equivalent  to  the 
marginal  distribution  of  the  error  contrasts  can  be  seen 
by  transforming  y  to  and  y2  =  L^y  where  L  is  a 
nx(n—p)  matrix  of  full  column  rank  satisfying  X  =  0. 
Conditionally,  P  is  constant,  so  maximizing  the  condi¬ 
tional  likelihood  of  y  is  equivalent  to  maximizing  the 
conditional  likelihood  of  y2.  Furthermore,  y2  and  ^  are 
independent  so  the  conditional  distribution  of  y2  is  the 
same  as  its  marginal  distribution. 

In  the  above  derivation,  iy  is  decomposed  as  the  sum 
of  a  marginal  and  a  conditional  likelihood.  Estimation 
of  7  proceeds  by  maximizing  the  conditional  and  then  /3 
is  estimated  by  maximizing  the  marginal 

4  Generalized  Linear  Models 

The  generalization  of  REML  to  generalized  linear  mod¬ 
els  can  now  be  stated.  Consider  the  probability  density 


function  defined  by 

f{y,  0,  =  exp[{j/0  -  K{e)}/ 4>  +  c{y,  4>)] 

For  given  values  of  <^,  this  is  a  linear  exponential  fam¬ 
ily  density  function.  Following  J0rgensen  (1987),  the 
distribution  defined  by  /(y;  0)  is  called  an  exponen¬ 

tial  dispersion  model  with  dispersion  parameter  and 
is  denoted  ED(p,<^)  where  fx  =  E{y)  =  «(^).  Let 
yi  ~  ED(/Zf,0i),  «  =  l,...,n,  be  independent  random 
variables,  A  generalized  linear  model  arises  if  a  link- 
linear  model  is  assumed  for  the  means,  g{yLi)  =  xj (3 
where  x,'  is  a  vector  of  covariates,  /3  is  an  unknown  p- 
vector  of  regression  parameters  and  ^()  is  a  known  link 
function.  We  assume  also  that  the  dispersions  depend 
on  an  unknown  parameter  vector  7,  for  example  through 
a  link-linear  model  /i(0t)  =  as  in  Smyth  (1989), 
where  Zj  is  a  vector  of  co variates. 

Let  ^  =  diag((^i)  and  A  be  the  n  x  p  matrix  with 
xf  as  fth  row.  We  assume  g{)  to  be  the  canonical  link 
function  such  that  ^f(pt)  =  so  that  t  =  is  a 

complete  sufficient  statistic  for  /3.  We  define  the  REML 
estimate  of  7  to  be  that  which  maximizes  the  conditional 
likelihood  of  y  given  t. 

REML  can  also  be  used  to  estimate  parameters  in 
the  variance  function  of  a  generalized  linear  model  if 
the  probability  density  can  be  completely  specified.  Let 
^0  be  a  parameter  vector  which  indexes  a  family  of 
exponential  dispersion  models,  ED^(p,  ^),  and  assume 
yi  ^  ED^(p{,<^i)  with  p,-  and  <f>i  as  given  above.  In  gen¬ 
eral  the  functions  /c(),  c()  and  g{)  will  depend  on  0,  and 
var(y)  =  <^»v(pi,0)  where  i'(p,0)  =  «(0).  We  define  the 
REML  estimates  of  0  and  7  to  be  those  which  maximize 
the  conditional  likelihood  of  y  given  t. 

The  next  two  sections  of  this  paper  work  out  REML 
estimates  for  certain  generalized  linear  models  in  which 
the  conditional  likelihood  can  be  obtained  in  closed  form. 

5  Dispersion  Estimation 

5.1  The  one-way  layout 

Consider  a  generalized  linear  model  with  means  de¬ 
scribed  by  a  one-way  classification,  i.e.,  let  pij,  i  = 
1, . . . ,  6,  j  =  1, . . . ,  n,*,  be  independent  random  variables 
with  yij  ED(/?i,7).  The  group  mean  yi  is  sufficient  for 
Pi  and  is  distributed  as  ED(^i,7/nj).  The  conditional 
log-likelihood  is 

^yi/9  = 
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h  r  n, 

»=i  [j=i 

For  example  suppose  the  y*  are  normally  distributed. 
In  that  case  c(y,  7)  =  —  4  log7  —  iy^  -  i  log  2%  (McCul- 
lagh  and  Nelder,  1989),  so 

*°g27r7  -  1  ^logn,- 
'  *  =  1 

where  N  —  Y^ni  and  D{y)  =  X^(yij  —  y,*)^.  The  condi¬ 
tional  maximum  likelihood  estimator  is  7  =  D{y)/{N  — 
6),  which  is  the  usual  residual  mean  square  estimator  of 
the  variance  in  one-way  analysis  of  variance. 

If  the  Yi  are  inverse-Gaussian,  then  c(y,7)  = 

1/(272/)  —  ^  log7  -  I  logy  —  I  log27r.  In  that  case 

log  2^7 

Z)  f  ii  *°g  yo  -  *°g  y*  /  -  ^  Z  "«■ 

*=i  /  * 


where  V’O  is  the  digamma  function.  This  can  be  com¬ 
pared  to  maximum  likelihood  estimation  of  7  which 
would  have  log(z/)  in  place  of  i>{niv)  in  the  last  term. 
Compare  with  Cox  and  Reid  (1987,  p.  12)  and  McCul- 
lagh  and  Nelder  (1989,  p.  295). 

5.2  Dispersion  Modelling 

Now  consider  the  one-way  layout  with  a  link-linear 
model  for  the  dispersion,  i.e.,  suppose  that  the  Vii  - 
YiD{pi,(j)ij)  and  the  <j)ij  are  a  function  of  a  g- vector  of 
parameters  7.  The  log-likelihood  is 

6  n  ^  1 

4  =  Z  Z  1  T"  )]  +  c(2/ii ,  4>ii ) 

)  ^ij  )  ^ 

where  a,-  =  (Er=i <«'  =  and 

Pi  =  k{6i).  Each  U  is  sufficient  for  Pi  and  is  distributed 
as  ED()^i,  a,-)-  The  conditional  log-likelihood  of  y  given 
the  ti  is 


=  z|;J-M«-«(^«)]  +  Z 

i=i  i“*  i=i 


where 


b  fi»  /  ^  ^  \  b  fii  f  _ 

™-SSfe-s)-S5‘« 


The  REML  estimator  of  7  is  the  residual  mean  square 
deviance,  7  =  Z)(y)/(iV'  —  6). 

In  both  normal  and  inverse-Gaussian  cases,  the  REML 
estimator  7  is  uniform  minimum  variance  unbiased  for 
7,  and  {N  —  6)7/7  X%^b  independently  of  the  y,-. 

For  the  gamma  distribution  we  have  c(y,  7)  = 

Iog(j//T)/7  -  logy  ~  logr(l/7)  so 


^  1=1  j=l 

b  b  /  m 

+  Zlo«r(n,/7)  -  E  I  E*°S^^  " 

»=i  »=i  Vi=i 


This  is  an  exponential  family  likelihood  with  canon¬ 
ical  parameter  i/  =  I/7,  sufficient  statistic  i^(y)  = 

cumulant  function  A(^')  = 

Anogr(i/)  —  logr(n,-i/).  The  REML  estimator  of 
7  is  obtained  by  equating  D{y)  to  its  expectation, 

b 

D{y)  =  A(i/)  =  Nij){u) 

i=l 


b  {  m 

^y\i  =  N 

*=1  (i=l 

5.3  General  Mean  Models 

We  now  leave  the  one-way  layout  and  consider  gen¬ 
eral  link-linear  models  for  the  Suppose  that  y,-  ^ 
ED(^,*,  ^t),  i  =  1, . . . ,  n,  with  link-linear  models  for  both 
fii  and  <j>i  as  described  in  Section  3.  The  sufficient  statis¬ 
tic  for  ^  is  t  =  and  this  has  cumulant  function 

*=i 

where  «()  is  the  cumulant  function  of  the  y,-.  The  cumu¬ 
lant  generating  function  of  t  is  K{s)  =  «t(/3-|-s)  — Kt(y3), 
so  the  probability  density  function  of  t  is  given  by 

/(<)  =  /  (e  -  .’■tj 

The  required  conditional  log-likelihood  is 

which  doesn^t  depend  on  p.  Except  in  the  normal  case, 
the  cumulant  generating  function  of  t  is  difficult  to  invert 
analytically,  so  either  numerical  evaluation  or  approxi¬ 
mation  will  generally  be  necessary. 
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One  possible  approximation  is  to  use,  following  a  sug¬ 
gestion  of  A.  T.  James  (James  and  Wiskich,  1993),  the 
asymptotic  normal  approximation  to  the  distribution 
of  This  leads  to  the  approximate  conditional  log- 
likelihood 

=  £y{r,fi,^)  +  ^\og2ir-^\og\X^WX\ 

+1{$-^)X^WX0-I3) 

where  W  =  diag{0^^v(//t)}  and  v()  is  the  variance  func¬ 
tion  defined  by  t;(//)  =  k{0).  This  expression  depends  on 
/3,  but  only  slightly,  so  we  can  set  ^9  =  y3,  yielding  the 
approximation 

4(y.A7)  +  ^log27r-ilog|X^W^X|  (1) 

i.e.,  the  log-profile  likelihood  for  7  adjusted  by  the  log- 
determinant  of  the  covariance  matrix  of  This  method 
is  applicable  even  when  the  link  function  g()  is  not 
canonical,  although  then  t  is  not  sufficient  so  it  is  im¬ 
possible  to  entirely  eliminate  P  from  the  estimation  of 
7- 

Another  approach  which  leads  to  the  same  approxima¬ 
tion  in  this  case  is  to  use  the  modified  profile  likelihood 
of  BarndorfF-Nielsen  (1983)  together  with  a  suggestion 
of  Cox  and  Reid  (1987)  for  orthogonal  parameters.  The 
modified  profile  likelihood  for  7  is 


4  (y ;  -  "y)  -  o  \m  I  +  log 


where  is  the  maximum  likelihood  estimator  for  /3  for 
given  7,  ^  is  the  unrestricted  maximum  likelihood  es¬ 
timator,  jpp  is  the  observed  information  matrix  for  /3 
evaluated  at  and  is  the  log-profile  like¬ 

lihood  for  7.  Since  P  and  7  are  orthogonal,  varies 
only  slowly  with  7  so  the  derivative  term  can 

be  neglected.  For  the  current  model  we  have 

=  x'^wx 


and  the  modified  profile  likelihood  is,  apart  from  con¬ 
stants,  the  same  as  (1). 

For  normal  linear  models,  the  approximate  conditional 
likelihood  (1)  is  precisely  the  same  as  the  standard  resid¬ 
ual  likelihood  given  in  Section  3.  When  the  t/j  are 
inverse-Gaussian  and  7  is  scalar,  modified  profile  like¬ 
lihood  leads  to  the  residual  mean  deviance  as  the  esti¬ 
mator  of  the  dispersion.  In  other  cases,  the  effectiveness 
of  the  approximation  needs  to  be  evaluated.  This  is  not 
done  here  as  our  primary  intention  is  to  clarify  the  exact 
conditional  approach. 


Table  1:  Simulation  results  for  estimating  7  and  <j>.  One 
thousand  data  sets  were  generated.  True  values  are  7  = 
1,5  and  (f>  =:  1.0. 


(a)  Estimation  of  7 


Mean 

Std 

MSE 

Maximum  likelihood 

1.4731 

0.0711 

0.0058 

REML 

1.4873 

0.0769 

0.0061 

Extended  Quasi-Lik. 

1.2345 

0.0961 

0.0798 

Pseudo-Likelihood 

1.5494 

0.1894 

0.0383 

(h)  Estimation  of  (j) 


Mean 

Std 

MSE 

Maximum  likelihood 

0.9010 

0.1809 

0.0425 

REML 

0.9915 

0.2048 

0.0420 

Extended  Quasi-Lik. 

1.0008 

0.2057 

0.0423 

Pseudo-  Likelihoo  d 

0.9015 

0.1904 

0.0460 

6  Variance  Function  Estimation 


Suppose  that  7  is  an  unknown  parameter  than  indexes 
a  family  of  generalized  linear  models.  That  is,  suppose 
that  yi  ^  EDy{fiiy<f>)y  i  =  l,.,.,n  where  g{fii)  =  xf/3 
and  var(?/t)  =  0t;(/it,7)-  The  REML  estimators  of  7 
and  (j)  are  those  which  maximize  the  conditional  like¬ 
lihood  of  y  given  X'^y.  The  purpose  of  this  section 
is  to  consider  a  potentially  important  example,  that  of 
the  compound  Poisson  exponential  dispersion  models  in¬ 
troduced  by  j0rgensen  (1987).  The  compound  Poisson 
models  have  power  variance  functions  i;(^,  7)  =  with 
7  between  one  and  two.  The  compound  Poisson  distri¬ 
butions  converge  to  Poisson  as  7  — 1  and  to  gamma  as 
7  2,  and  so  may  be  viewed  as  intermediate  between 

the  Poisson  and  gamma  families.  They  are  also  posi¬ 
tive  and  continuous  except  for  mass  at  zero.  Compound 
Poisson  generalized  linear  models  have  potential  appli¬ 
cations  in  modelling  continuous  data  with  exact  zeros, 
such  as  weather  variables,  insurance  claims  and  waiting 
times,  but  the  problem  of  estimating  7  has  not  been  sat¬ 
isfactorily  solved  (Burridge,  1987;  Gilchrist,  1987). 

The  compound  Poisson  density  function  has  been  de¬ 
rived  by  J0rgensen  (1992).  See  also  Tweedie  (1984).  It 
has  9  =  //^“'^/(2  —  7),  k{9)  =  /i^"^/(l  -  7)  and 


00 

c{y,  (^)  =  log^ 
i=i 


{a(a+l)”+V-“-^y”P' 

i!r(ia) 


where  a  =  (2  -  7)/(7  1).  Tweedie  (1984,  p,  586)  has 

identified  expc(j/,  ^)  as  an  instance  of  Wright’s  (1933) 
generalized  Bessel  function.  It  is  not  expressible  however 
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in  terms  of  the  more  common  Bessel  functions. 

A  simulation  experiment  was  conducted  to  compare 
four  estimators  of  <j>  and  7.  These  were  maximum 
likelihood  estimation,  REML,  extended  quasi-likelihood 
(Nelder  and  Pregibon,  1987)  and  pseudo-likelihood  (Da- 
vidian  and  Carroll,  1987).  Data  was  simulated  from 
a  one-way  classification  with  ni  =  . . .  =  ns  =  10, 
(3  =  (0.1, 0.5, 1,2,5)^,  =  1  and  7  =  1.5.  One  thou¬ 
sand  such  data  sets  were  generated  and,  for  each,  7  and 
<t>  were  estimated  using  the  four  methods.  The  results 
are  tabulated  in  Table  1. 

REML  had  the  smallest  biais  for  estimating  7.  Maxi¬ 
mum  likelihood  had  the  smallest  standard  deviation,  and 
also  the  smallest  mean  square  error,  although  this  was 
not  significantly  different  from  that  of  REML.  Pseudo¬ 
likelihood  was  also  approximately  unbiased,  but  with  a 
largest  standard  deviation.  Extended  quasi-likelihood 
had  a  competitive  standard  deviation,  but  was  biased 
down  giving  it  the  largest  mean  square  error.  Experi¬ 
mentation  showed  that  the  bias  was  due  to  the  offset  of 
1/6  for  zero  observations.  Positive  and  negative  biases 
could  be  achieved  by  relatively  small  changes  to  this  off¬ 
set. 

REML  and  extended  quasi-likelihood  were  almost 
equally  effective  for  estimating  tp.  The  maximum  like¬ 
lihood  estimator  had  again  the  smallest  standard  devi¬ 
ation  and  a  mean  square  error  not  significantly  greater 
than  REML  and  extended  quasi-likelihood,  but  was  bi¬ 
ased  down  by  about  10%,  as  expected  given  the  group 
size  of  ten.  The  pseudo-likelihood  estimator  was  also 
biased  down  by  about  the  same  amount,  despite  incor¬ 
porating  a  correction  for  degrees  of  freedom  as  recom¬ 
mended  by  Davidian  and  Carroll  (1987). 

We  conclude  that  REML,  in  its  conditional  likelihood 
guise,  is  successful  in  reducing  the  bias  of  the  maximum 
likelihood  estimator  while  incurring  minimal  inflation 
to  its  standard  deviation.  Neither  of  its  competitors, 
extended-quasi  and  pseudo  likelihood,  were  as  successful 
in  doing  this. 
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Abstract 

Bayesian  statistical  applications  often  present  high  di¬ 
mensional  integration  problems  that  require  Monte 
Carlo  integration.  Simple  Monte  Carlo,  in  contrast  to 
fixed  integration  rules,  does  not  exploit  the  smooth¬ 
ness  that  one  would  expect  from  a  posterior  distribu¬ 
tion.  Two  techniques  are  used  to  construct  hybrid  ran¬ 
dom  multidimensional  integration  rules.  First  random 
orthogonal  transformations  are  used  to  reduce  the  in¬ 
tegration  to  one  dimension.  Then,  random  integration 
rules  are  derived  for  infinite  integration  intervals,  gen¬ 
eralizing  rules  developed  by  Siegel  and  O’Brien  (1983) 
for  finite  intervals.  These  new  rules  are  constructed  for 
both  Normal  and  Student-t  weight  functions.  Both  the 
combined  methods  produce  random  rules  for  multidi¬ 
mensional  integrals  over  infinite  regions  with  Normal  or 
Student-t  weights.  Example  results  are  presented  to  il¬ 
lustrate  the  effectiveness  of  the  new  rules  for  estimating 
integrals  that  arise  in  Bayesian  statistical  computation. 


1  Introduction 


A  standard  problem  in  Bayesian  analysis  is  to  numeri¬ 
cally  compute  integrals  in  the  form 


r  r ...  r 

J—ooJ—oo  J—c 


g{0)p{G)d0md0m-i^-‘dBu 


with  e  =  The  function  p(fl)  is  an  un¬ 

normalised  posterior  density  function  and  g{0)  is  some 
function  for  which  an  approximate  expected  value  is 
needed.  We  will  assume  that  the  posterior  density 
p(0)  is  unimodal  and  approximately  multivariate  nor¬ 
mal  (0  ~  JV„i(/x,E),  or  multivariate  Student-t  (0  ~ 
E)).  Usually,  expectations  for  several  ff(0)’s  are 
needed,  and  a  typical  practical  calculation  might  use  a 
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vector  g(0)  =  (1, 0, 00*),  so  that  a  normalizmg  constant 
and  the  approximate  mean  and  covariance  matrix  for  0 
could  be  determined. 

This  type  of  integration  problem  has  traditionally 
been  handled  using  Monte-Carlo  algorithms.  The  sim¬ 
plest  forms  of  these  algorithms  have  low  accuracy  and 
slow  convergence,  so  a  number  of  refinements  have  been 
proposed  (see  the  book  by  Davis  and  Rabinowitz,  1984, 
and  the  more  recent  paper  by  Evans  and  Swartz,  1992). 
One  strategy  that  is  usually  effective  for  Monte-Carlo 
error  reduction  is  importance  sampling.  With  this  strat¬ 
egy,  p(0)  is  approximated  by  some  function  h{0),  which 
is  relatively  easy  to  sample  from.  The  original  integral 
is  then  approximated  by 

i=l 

where  sample  points  {0»}  are  drawn  randomly  with  den¬ 
sity  h(0).  The  standard  error  from  the  sample  provides 
a  robust  error  estimate  for  the  integral.  If  /^(0)  is  a  good 
approximation  to  p(0)>  then  the  sample  variance  is  sig¬ 
nificantly  reduced  (along  with  the  error),  compared  to 
Monte-Carlo  without  importance  sampling. 

Our  new  method  can  be  considered  a  refinement  of 
Monte-Carlo  with  importance  sampling,  but  it  should  be 
better  than  simple  Monte-Carlo  with  importance  sam¬ 
pling  because  the  resulting  integration  rule  will  give  the 
exact  result  whenever  the  importance  modified  integrand 
9i^){p{^)/H^))  ^  cubic  polynomial.  Simple  Monte- 

Carlo  with  importance  sampling  results  are  exact  when¬ 
ever  the  importance  modified  integrand  is  constant,  so 
the  new  method  is  expected  to  be  significantly  more 
accurate  than  simple  importance  sampled  Monte-Carlo 
whenever  the  importance  modified  integrand  is  not  con¬ 
stant,  but  still  has  a  reasonably  accurate  low  degree  poly¬ 
nomial  approximation. 

Our  method  uses  a  multivariate  normal  or  a  multivari- 
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ate  Student-t  approximation  to  p{9)>  For  these  approxi¬ 
mations,  we  assume  that  a  standardizing  transformation 
in  the  form  9  =  $1  +  Cx  has  been  determined  for  our 
problem,  using  numerical  optimization  if  necessary.  We 
have  used  /i  to  denote  the  point  where  log(p(0))  is  max¬ 
imized,  E  to  denote  the  inverse  of  the  negative  of  the 
Hessian  matrix  for  log(p(d))  at  //,  and  C  to  denote  the 
lower  triangular  Cholesky  factor  for  E  (S  =  CC*),  Then 
the  transformed  integrals  that  we  consider  take  the  form 

Hf)=  f  f  •"  f  «'(l|x||)/(x)diBmd®m_i...da!i, 

*/— 00  J—oo  J—oo 


to  develop  a  method  for  computing  multivariate  nor¬ 
mal  probabilities.  We  still  need  random  de^ee  three 
rules  for  integrals  of  the  form  /^h(r)(ir, 

or  We  discuss  these 

degree  three  radial  rules  in  the  next  section.  We  then 
show  how  the  radial  rules  can  be  combined  with  the 
spherical  surface  rules  to  produce  random  rules  for  /(/). 
In  the  section  three,  we  demonstrate  the  use  of  the  new 
rules  with  two  test  Bayesian  computation  problems. 

2  Random  Radial  Rules 


where  /(x)  =  g(^  +  Cx)p(/ji  -f-  C'x)/ii;(||x||),  and 

TZ7(||x||)  =  or  u;(||x||)  =  (1  + 

If  our  approximation  to  the  posterior  density  is  a  good 
one,  then  we  expect  /(x)  to  be  well  approximated  by  a 
low  degree  polynomial  in  x,  and  this  motivates  our  con¬ 
struction  of  random  multimensional  integration  rules  for 
polynomials.  These  rules  are  generalizations  of  the  de¬ 
gree  three  rules  derived  for  the  interval  [-1,1],  with  weight 
u;(r)  =  1,  developed  by  Siegel  and  O’Brien  (1983).  Ear¬ 
lier  work  by  Hammersley  and  Handscomb  (1964)  also 
considered  the  construction  of  random  integration  rules 
for  finite  intervals. 

Our  development  of  the  random  multidimensional  in¬ 
tegration  rules  requires  an  additional  change  of  variables 
to  a  radial-spherical  coordinate  system.  We  let  x  =  rz, 
with  z*z  =  1,  so  that  x*x  =  r^,  for  r  E  [0, 00).  Then 

iy(r)r"^”  ^f(rz)dzdr 

z*z=i 

-  ^  I  J  Mr)\r\”'-^f{rx)dzdr/2. 

Z*Z=1 


We  define  a  basic  radial  integration  rule  i2p(h)  by 
R,ih)  =  h{0)+^{hip)  +  hi-p)-2hi0)) 

«  /  |rP"^ii;(r)/i(r)dr, 

*/— 00 

where  p  is  a  positive  real  number,  ti;(r)  is  now  nor¬ 
malized  so  that  |r|"'"’^iy(r)dr  =  1,  and  a  = 
|rp‘^^Ti;(r)dr.  We  can  prove  (see  Genz  and  Mona^ 
han,  1994)  the  following  theorem,  which  establishes  two 
important  properties  of  the  rules  Rp{h), 

Theorem  1  If  p  is  a  random  variable  on  (0,00),  with 
density  ~r"^‘^^it;(r),  then 

■R/>(^)=  /  |rr"^iu(r)%)dr, 

7  —  00 

whenever  h  is  cubic  polynomial,  and 

E{Rp{h)}=  [  |r|”*-^«;(r)A(r)dr, 

J  —  OO 

for  any  iniegrable  h. 


We  want  to  compute  numerical  approximations  to  /(/) 
so  we  need  integration  rules  for  the  surface  of  the  unit 
m-sphere  defined  by  =  1,  and  for  the  radial  interval 
[—00, 00].  For  the  spherical  surface  integrals  we  use 

Sq  (a)  =  —  5^(a(-<3ej )  +  a(+<?Cj- ))  w  J  s{z)dz, 

where  cy  =  (0, 0, 1, 0, ...,  0)*  and  Q  is  er  mx  m  ran¬ 
dom  orthogonal  matrbc..  The  integration  rule  Sq{3)  is 
a  degree  three  rule  (see  Stroud,  1971,  p.  294)  for  the 
surface  of  the  unit  m-sphere.  If  Q  is  chosen  uniformly 
(see  Stewart,  1980),  Sq  is  an  unbiased  random  degree 
three  rule  for  the  surface  of  the  unit  m-sphere.  Deak 
(1990)  uses  a  transformation  to  a  similar  spherical  coor¬ 
dinate  system  with  random  orthogonal  transformations 


For  the  two  specific  weight  functions  that  we  are  inter¬ 
ested  in,  we  have  determined  that  a  =  m  when  iu(r)  ~ 
and  a  =  when  iy(r)  ~  (1  -f 
A  random  degree  three  radial  rule  Rp  can  now  be  com¬ 
bined  with  a  random  degree  three  spherical  rule  Sq  to 
produce  a  random  degree  three  rule  for  /(/).  We  first 
let  Dp{f,x)  =  ifipx)  +  f{-px)  -  2/(0))/(2p’).  Then 
our  combined  random  spherical  radial  integration  rule 
for  I{f)  is  given  by 
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It  is  easy  to  establish  the  following  (Genz  and  Monahan, 
1994) 

Theorem  2  If  p  is  a  random  variable  on  (0,  oo)  with 
density  and  Q  is  anmxm  uniform  random 

orthogonal  matrix,  then 

SRqAf)=  i  f  -  /  «'(II®II)/(®M® 

J-^ooJ—oo  «/— oo 

whenever  f  is  cubic  polynomial,  and 

EiSRqAf)}  =  /  I  -  I 

^  —  00  •/— oo  */  — 00 

for  any  integrable  f. 

The  unbiased  degree  three  rules  SRQ^p{f)  form  the 
basis  for  the  following  algorithm: 

Spherical-Radial  Rule  Integration  Algorithm 

1.  Input  e,  m,  /,  w  and  Nmat- 

2.  Set  JV  =  0,  /  =  0,  V  =  0  and  compute  a  and  /(O). 

3.  Repeat 

(a)  Set  SR  =  0. 

(b)  Generate  a  uniformly  random  orthogonal  m  x 
m  matrix  Q. 

(c)  Generate  p  from  the  density 

(d)  For  j  =  1, 2, m  set 

SR  =  SR+{f{pQej)  +  fi-pQej)  -  2/(0))/(2p’). 

(c)  Set  SR  =  f{0)+aSR/m,  N  =  N  +  1, 

D  =  {SR-  I)/N,  I  =  I  +  D  and 
V  =  V  +  (JV  -  1)ND^. 

Until  ^/V/{N{N  -  1))  <€oiN  = 

4.  Output  I  w  1(f),  <T  =  y/V/(NiN  -  1))  and  N. 

The  input  e  is  an  error  tolerance,  the  input  Nmax  Pro¬ 
vides  a  limit  on  the  time  for  the  algorithm,  and  the  out¬ 
put  cr  is  the  standard  error  for  the  integral  estimate  I. 

3  Examples 

The  first  example  is  a  three  dimensional  nonlinear  regres¬ 
sion  problem  from  the  time  series  book  by  Puller  (1976). 
The  posterior  is  given  by 

p(e)  =  (10  +  -di- 

t=l 


with  0  €  (-00,  oo)®.  We  model  p{9)  with  a  multivariate 
normal  approximation,  so  we  use 

f(x)  =  Jire***/®p(p  +  C(a5i,  *2,  ®3)*), 

after  computing  the  mode  p  and  C  for  log(p).  The  con¬ 
stant  K  is  chosen  to  prevent  underflow  in  the  numer¬ 
ical  evaluation  of  /.  For  this  example  K  =  e®®.  In 
the  following  table  we  show  results  from  the  SR  rules. 
For  comparison,  we  also  show  results  from  simple  im¬ 
portance  sampled  Monte-Carlo  rules,  where  the  compo¬ 
nents  for  the  sample  points  x  are  random  drawn  from 
N{0, 1).  The  entries  in  the  error  colunms  are  the  stan¬ 
dard  errors  obtained  from  the  random  samples  for  the 
respective  methods.. 


Test  Results  with  10,000  /  Values 


Simple  M-C 

SR  Rules 

/ 

E{f} 

Error 

E{f} 

Error 

p/w 

0.5773 

0.0403 

0.5277 

0.0103 

eipiw 

140.6531 

9.5358 

141.1459 

2.7229 

-83.7048 

6.0097 

-83.5633 

1.6003 

Bzpfv) 

1.4968 

0.1487 

1.4242 

0.0334 

The  standard  errors  for  the  simple  Monte-Carlo  rules  arc 
3-4  times  larger  than  those  for  SR  rules.  This  indicates 
that  approximately  ten  times  more  computer  time  would 
be  needed  for  the  crude  Monte-Carlo  rules  to  achieve  an 
accuracy  level  similar  to  that  achieve  by  the  SR  rules. 

For  our  second  example  we  use  a  seven  dimensional 
proportional  hazards  model  problem  discussed  by  Del- 
laportas  and  Wright  (1992)  and  Lawless  (1982).  The 
posterior  is  given  by 

i=l  <=1 

with  p  >  0  and  (3  G  (— oo,  oo)®.  After  we  first  transform 
p  using  xi  =  log(p),  we  model  p{0)  with  a  multivariate 
normal  approximation.  So  we  use 

f{x)  =  +  C(e*S  «a, xrY), 

after  computing  the  mode  p  and  C  for  log(p)  +  log(p). 
For  this  example  K  =  In  the  following  table  we 

show  results  for  the  SR  rules  and  simple  importance 
sampled  Monte-Carlo  rules. 
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Test  Results  with  75,000  /  Values 


Simple  M-C 

SR  Rules 

/ 

MMSM 

E{f} 

Error 

pIw 

0.3918 

0.0026 

0.3907 

0.0015 

pp/w 

1.1747 

0.0086 

HFCTa 

PipIw 

.4.06li 

0.0319 

/32pIw 

1.9680 

0.0194 

0.0065 

fiaplw 

-0.1217 

0.0008 

gnreini 

0.0005 

PapIv} 

0.0002 

-0.0193 

PbpIw 

-0.0418 

0.0033 

-0.0418 

Peplw 

0.1251 

0.0013 

0.1251 

0.0004 

For  this  example  the  SR  rule  results  have  standard  er¬ 
rors  that  are  approximately  half  as  large  as  those  for 
the  simple  Monte-Carlo.  These  results  are  not  as  good 
as  those  for  the  three  dimensional  problem,  but  the  SR 
rules  are  still  approximately  four  times  more  efficient 
than  the  simple  Monte-Carlo  rules. 

In  order  to  monitor,  and  possibly  improve,  the  con¬ 
vergence  of  the  SR  rules,  we  have  considered  the  de¬ 
velopment  of  convergence  diagnostics  for  the  simple  SR 
integration  algorithm  described  in  the  previous  section. 
The  integrand  /(x)  for  a  given  integration  problem  could 
have  more  (or  less)  variation  around  the  spherical  sur¬ 
face  than  it  does  along  radial  directions.  If  we  had  di¬ 
agnostics  to  determine  these  differences  in  variation,  our 
simple  algorithm  could  be  modified  to  increase  sampling 
in  either  the  spherical  or  radial  directions,  in  order  to  to 
adapt  to  these  differences. 

A  natural  diagnostic  for  the  spherical  variation  for  a 
fixed  radius  p  is  the  sample  variance  for  the  SR  aver¬ 
age  that  is  accumulated  in  the  loop  at  step  3(d)  in  the 
algorithm.  Alternately,  with  Q  fixed,  a  loop  could  be 
introduced  at  step  3(c)  so  that  several  different  p’s  could 
be  used  and  the  variance  in  the  resulting  SR  rules  could 
be  used  as  a  diagnostic  for  radieJ  variation.  A  relatively 
large  variation  in  the  radial  direction  might  indicate  that 
a  multivariate  normal  model  was  not  valid.  Therefore  a 
multivariate  Student  t  model  might  be  more  appropri¬ 
ate,  and/or  the  number  of  samples  in  the  radieJ  direction 
could  be  increased.  Alternatively,  a  relatively  large  vari¬ 
ance  for  the  SR  average  would  suggest  that  more  Q’s 
should  be  used  for  each  p.  This  could  be  accomplished 
by  interchanging  steps  3(b)  and  3(c)  and  adding  a  loop 
at  the  modified  step  3(c)  that  would  allow  several  dif¬ 
ferent  Q’s  to  be  generated  for  each  p.  A  more  general 
algorithm  could  have  nested  loops  at  both  steps  3(b)  and 
3(c),  with  lengths  dynamically  adjusted  to  balance  the 
radial  and  spherical  variances. 


4  Concluding  Remarks 

We  have  described  degree  three  random  integration  rules 
that  can  be  used  to  numerically  estimate  integrals  over 
infinite  regions.  Results  from  two  examples  suggest  that 
averages  of  samples  of  these  rules  can  provide  more  ac¬ 
curate  integral  estimates  than  simpler  Monte-Carlo  im¬ 
portance  sampling  methods.  However,  in  contrast  to 
traditional  polynomial  rules  for  numerical  integration, 
the  standard  errors  from  the  random  rule  samples  can 
be  used  for  robust  error  estimation. 

For  future  work  with  these  rules  we  intend  to  consider 
more  examples,  and  develop  and  implement  heuristics  to 
automate  the  incorporation  of  the  variance  diagnostics 
into  our  algorithm.  We  also  hope  to  extend  our  work  to 
include  random  degree  five  rules  for  infinite  regions. 
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Abstract 

PVM  is  a  software  that  allows  a  heterogeneous  net¬ 
work  of  parallel  and  serial  computers  to  use  distributed 
memery  to  do  concurrent  computation.  Due  to  lack  of 
accessing  a  parallel  computer  to  do  complicated  compu¬ 
tation,  we  show  how  to  parallelize  the  Beaton’s  sweep  op¬ 
eration  on  computation  of  the  analysis  of  regression  and 
designed  experiments,  therefore  the  analysis  of  repeated 
measurement  designs  under  PVM  tool.  The  performance 
of  the  parallelized  sweep  operator  will  be  evaluated. 

1.  Introduction 

Parallel  processing  is  increasingly  important  in  scien¬ 
tific  fields  such  as  statistical  computing.  In  statistically 
computational  intensive  area  such  as  regression  analysis 
and  analysis  of  experimental  designs,  existing  sequen¬ 
tial  algorithms  should  be  parallelized  so  that  applica¬ 
tions  can  be  processed  by  parallel  computers  to  speed 
up  the  computation.  However,  parallel  computers  are 
too  costly  for  most  colleges  to  be  obtained. 

A  high  level  programming  environment,  called  PVM 
(Parallel  Virtual  Machine),  can  be  utilized  in  clusters  of 
heterogeneous  networked  Unix  workstations  and  paral¬ 
lel  computers  to  do  parallel  processing  without  calling 
low  level  Unix  utilities  from  communication  layer  such 
as  socket([l],[4],[7]).  PVM  is  an  on  going  project,  started 
in  the  summer  of  1989  at  Oak  Ridge  National  Lab.  Bar 
sically,  PVM  generates  a  series  of  tasks,  like  Unix  pro¬ 
cesses.  They  are  synchronized  by  using  message-passing 
technique  to  pass  data  between  them  and  solve  a  prob¬ 
lem  in  parallel.  The  applications  can  be  programmed  in 
either  Fortran  77  or  C. 

In  computation  of  analysis  of  regression  and  designs 
of  experiments  a  common  method  used  to  solve  a  nor¬ 
mal  equation  in  most  statistical  software  is  the  Beaton’s 
sweep  operation  ([2], [3]).  In  addition,  this  method  also 
gives  insight  into  the  least  square  method.  Some  impor¬ 
tant  statistical  measures  are  the  products  of  the  process 
of  sweeping.  However,  the  sweep  operation  is  designed 
to  be  sequential  and  it  is  difficult  to  utilize  the  power  of 
parallel  computers  while  sweeping. 

In  this  paper  a  method  is  proposed  to  parallelize 
the  sweep  operation.  The  algorithm  is  implemented 


in  Fortran  77  under  PVM  3.1  and  the  speed-up  is 
evaluated([8]). 

2/  Sweep  Operation 

The  most  fundamental  operation  in  regression  anal¬ 
ysis  and  analysis  of  variance  is  Beaton’s  sweep  oper¬ 
ation.  It  is  one  of  the  most  simple  methods  to  solve 
the  normal  equation  and  also  a  process  of  matrix  inver¬ 
sion  by  bordering([3]).  Beaton’s  sweep  operation  sweeps 
through  a  positive  semidefinite  matrix,  the  design  matrix 
X*X  or  S  matrix,  and  produces  sum  of  square  of  resid¬ 
uals  and  regression  coefficients  for  a  regression  model, 
or  sum  of  squares  for  the  residuals  and  all  effects  for  a 
model  of  designed  experiments  ([6]). 

The  algorithm  of  the  sweep  operation  is  described  as 
follows: 

Sweep  on  i  th  row  of  the  S={Sij)  matrix  and  resulting 
a  matrix  T=(7ij): 

If  Sii  0,  then  Ta  =  and  Tij  =  for  i  7^  i. 
Also  Tjh  =  Sjk  -  Sji  X  for  k  7^  i  and  j  ^  i,  else 
Tii  =  0  and  Tij  =  0,  for  j  ^  i.  Also  Tjk  =  Sjk,  for  k  i 
and  j  ^  2. 

The  sweep  operator  has  the  properties  of  associativity 
and  commutativity.  The  operation  is  purely  designed  as 
sequential,  that  is,  the  matrix  which  is  used  for  the  cur¬ 
rent  sweep  completely  depends  on  the  resulting  matrix 
of  the  previous  sweep.  In  order  to  parallelize  the  sweep 
operation  so  it  can  be  used  in  regression  analysis  and 
analysis  of  experimental  designs,  the  operation  must  be 
studied  in  detail. 

3.  Parallelize  Sweep  Operation 

Given  a  n  x  m  regression  matrix  X  the  positive 
semidefinite  symmetric  matrix  S=(5ij)  with  dimension 
m  in  Ith  sweep,  can  be  divided  into  four  different  areas 
as  follows: 

Area  A:  Sij^  *=I-|-1,  and  I+l<  j  <m 

Area  B:  5,^,  I+2<  i  <m,  and  I4-1<  j  <m 

Area  C:  1<  i  <1,  and  I+l<  j  <m 

Area  D:  Sij,  1<  i  <1,  and  1<  j  <I 

Since  the  I+lth  sweep  must  use  the  value  of  Si^u+ij 
area  A  and  B  can  be  performed  first  and  the  elements  in 
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area  C  and  D  during  Ith  sweep  does  not  depend  on  the 
next  sweep.  Hence  area  C  and  D  of  the  current  sweep 
can  be  performed  concurrently  with  area  A  and  B  of  the 
next  sweep.  By  overlapping  the  area  C  and  D  and  area 
A  and  B  of  the  problem  domain,  we  can  save  part  of 
the  total  sweeping  time.  Areas  A,  B,  C  and  D  will  be 
processed  concurrently  by  computers  A,  B,  C  and  D  in 
a  network.  Since  matrix  S  is  symmetric,  we  can  only 
compute  either  upper  or  lower  diagonal  elements  during 
sweeping  in  order  to  save  execution  time.  Results  will 
be  distributed  from  computer  A  to  B,  B  to  C,  C  to  D. 

In  the  next  section  PVM  will  be  used  to  implement 
the  parallel  concept. 

4.  Implementation  Under  PVM 

PVM  can  be  started  heterogeneously  in  different  hosts 
through  a  background  daemon  process  called  pvmd  in 
each  host.  Users  are  very  easy  to  add  or  delete  hosts 
as  their  wishes  in  PVM  environment.  Each  daemon  will 
communicate  each  other  through  message  passing.  Un¬ 
der  master-slave  model  it  is  easy  to  set  a  central  control 
for  user  to  do  input  and  output.  That  is,  the  master 
program  will  allow  user  to  enter  the  initial  data  and  also 
collect  the  results  to  the  user.  Master  program  and  each 
slave  program  will  be  run  on  each  different  host.  And 
data  sent  as  messages  will  be  passed  between  master  and 
slaves.  Each  host  runs  the  same  number  of  sweep  opera^ 
tions  except  that  host  A  will  sweep  on  only  the  elements 
in  area  A.  Similarly  for  other  hosts. 

The  process  of  writing  a  program  under  PVM  is  a 
little  complicated.  Once  user  has  run  a  sequential  pro¬ 
gram  successfully  in  one  host,  one  will  implement  the 
corresponding  parallel  program  with  different  tasks  (or 
processes)  into  the  same  host.  Finally,  the  successful  par¬ 
allel  program  will  be  implemented  under  different  hosts. 

4.1.  Algorithm 

Based  on  master-slave  model,  the  master  program  will 
spawn  four  different  processes,  i.e.  module  A,  B,  C  and 
D.  Each  module  receives  initial  matrix  information  and 
sweeps  on  the  corresponding  area  of  the  matrix  by  us¬ 
ing  message  passing  to  maintain  the  order  of  sweeping. 
The  sweeping  oder  is  by  executing  module  A  first.  Then 
the  results  of  module  A  will  be  sent  to  module  B.  After 
module  B  finishes,  the  results  will  be  sent  to  module  C 
to  continue  the  current  sweep.  In  the  mean  time  the  re¬ 
sults  of  module  B  will  be  sent  to  module  A  again  to  start 
the  next  sweep  operation  if  necessary.  Once  module  C 
finishes,  it  will  send  the  results  to  module  D  to  finish  up 
the  current  sweep  operation.  If  necessary,  the  results  of 
module  D  of  the  current  sweep  and  the  results  of  mod¬ 
ule  B  of  the  next  sweep  will  be  sent  altogether  to  module 


C  of  the  next  sweep  to  continue.  After  the  number  of 
sweeping  user  requested  to  do,  the  final  resulting  matrix 
will  be  sent  back  to  master  program. 

The  following  are  the  algorithms  for  each  program 
used: 

Algorithm  lor  master  program: 

1.  spawn  tasks  module  A,  B,  C  and  D. 

2.  enter  user’s  data  matrix  S 

3.  pack  all  data 

4.  broadcast  data  to  module  A,  B,  C  and  D. 

5.  wait  lor  receiving  results  Irom  module  D 

6.  unpack  the  results 

7.  output  the  results 

Algorithm  lor  module  A  program: 

1.  receive  the  matrix  S  Irom  master  program 

2.  unpack  data 

3.  loop  until  no  more  sweeps 

(i)  receive  the  last  sweep  Irom  module  B 

(ii)  unpack  results 

(iii)  sweep  the  matrix  in  area  A 

(iv)  pack  results 

(v)  send  the  results  to  module  B 
endloop 

Algorithm  lor  module  B  program: 

1.  receive  the  matrix  S  Irom  master  program 

2.  unpack  data 

3.  pack  initial  S  data  matrix 

4.  send  initial  S  data  matrix  to  start 
module  A  and  trigger  the  stcirt  of  the 
sweep  operation 

5.  loop  until  no  more  sweeps 

(i)  receive  the  results  Irom  module  A 

(ii)  unpack  results 

(iii)  sweep  the  matrix  in  area  B 

(iv)  pack  results 

(v)  send  the  results  to  module  C 

(vi)  send  the  results  to  module  A 
endloop 

Algorithm  lor  module  C  program: 

1.  receive  the  matrix  S  Irom  master  program 

2.  unpack  data 

3.  loop  until  no  more  sweeps 

(i)  receive  the  results  Irom  module  B 

(ii)  unpack  results 

(iii)  receive  the  last  sweep  from  module  D 

(iv)  unpack  results 

(v)  sweep  the  matrix  in  area  C 

(vi)  pack  results 

(vii)  send  the  results  to  module  D 
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endloop 

Algorithm  lor  module  D  program: 

1.  receive  the  matrix  S  from  master  program 

2.  unpack  data 

3.  loop  until  no  more  sweeps 

(i)  receive  the  results  from  module  C 

(ii)  unpack  results 

(iii)  sweep  the  matrix  in  area  D 

(iv)  pack  the  results 

(v)  send  the  results  to  module  C 
endloop 

4.  pack  the  final  results 

5.  send  the  results  back  to  master  program 


4.2.  Example 

Given  ni=4  and  w=3,  the  matrix  X  is 
/111  i\ 

^  ~  1  2  3  -1  I 

\1  3  0  2  / 

Suppose  Pi  denotes  the  initial  matrix  going  to  module  P 
before  sweeping  on  area  P  in  ith  sweep,  then  following 
the  algorithm  above,  we  have 
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And  the  resulting  matrix  is 

/ -2.805  0.956 

0.956  -0.435 

0.587  -0.130 

V  4.02  -0.78 


0.587  4.02 

-0.130  -0.78 

-0.239  -0.935 
-0.935  4.891 


5.  Performance  Evaluation 

Using  the  parallel  algorithm  used  in  the  leist  sec¬ 
tion,  the  number  of  different  arithmetic  operations  is 
described  in  the  following  table: 

At  Ith  sweep:  _ _ 

I  Module  I  Number  of  +,-  |  Number  of  *,/  | 


A 

m~I 

2(m4) 

B 

(m-I)(m-I-l) 

C 

(I-l)(m-I) 

(m-I)(2I-l) 

I  D  I  (1-1)4  I  I 

For  Sparc  Station,  the  time  to  execute  a  floating  point 
*  or  /  is  about  twice  as  much  as  that  of  -I-  or  -.  Let 
u  be  the  time  to  perform  one  -f  or  -  operation,  then  in 
terms  of  u  time  unit,  at  Ith  sweep,  module  A,  B,  C  and 
D  take  5(m-I),  5(m-I)^’"~2~^\  (m-I)(5I-3)  and  (5I-l)f 
times,  respectively. 

Since  module  C  of  the  current  sweep  starts  at  the  same 
time  as  module  A  of  the  next  sweep,  the  time  which  the 
parallel  algorithm  can  be  saved  is  either 
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Y2(iimeC  -f  D)  on  the  Ith  sweep  if  time  A+B  on  the 
(H-l)th  sweep  >  time  C-fD  on  Ith  sweep.  That  is,  0  < 
I<[aJor|’b]<I<w, 

or  Y^{timeA  4-  B)  on  the  (I4l)th  sweep  otherwise. 
That  is,  fa]<l<[bj,  where  w  is  the  number  of 
sweeps  user  request  to  do, 

10m  —  y/bOm?  —  10m 

a  = - 

10 

and 

10m  +  y/SOrri^  —  10  m 

‘  = - w - 

with  m  is  the  dimension  of  the  data  matrix  S. 

Then  the  total  time  saved  after  w  sweeps  using  the 
parallel  algorithm  over  the  sequentail  algorithm  in  terms 
of  time  unit  u  is 

im  *(  W  -1)*(2  W  - 1)-  W  *(LaJ  +  l)*(2[aj  +1)] 
-4w(w-l)(2w-l)  +(5m+|)  4w(w-l)  +5m  [[aj  *  ([aj  + 
1)_  [b]  *  (L6J  - 1)]  -3in(w-l)  +(|m2  +  f )  ([6]  -[a\  - 1) 

The  total  time  after  w  sweeps  for  the  sequential  algo¬ 
rithm  is 

Assuming  no  delaying  time  for  the  messages  being 
sent, wait  and  received,  then  for  m=200  matrix  S  after 
w=197  sweeps,  [aJ  =  58  and  [6J  =  w  =197.  In  terms  of 
time  unit  u,  the  total  time  saved  is  3849535  out  of  total 
time  being  19680300.  The  spee  up  is  1.24. 

Both  the  sequential  and  parallel  algorithms  are  imple¬ 
mented  in  Fortran  77  and  executed  by  Sparc  Station. 
The  results  are  about  10  seconds  for  a  randomly  gener¬ 
ated  m=200  matrix  after  w=197  sweeps.  The  parallel 
algorithm  does  not  show  the  advantage  is  just  due  to  a 
lot  of  the  waiting  time  spent  on  distributing  data,  send¬ 
ing  and  receiving  results  between  tasks. 

In  order  to  reduce  the  number  of  message  passing  we 
can  combine  module  A  and  module  B  as  one  slave,  and 
also  combine  module  C  and  module  D  as  another  slave, 
then  the  number  of  message  passing  after  w  sweeps  will 
be  changed  from  5w-2  to  w.  Then  after  w=197  sweeps, 
the  actual  parallel  sweeping  time  will  go  down  from  10 
seconds  to  8.4  seconds.  And  the  speed  up  can  be  reached 
to  1.19  in  practice. 

6.  Conclusion 

In  this  study  we  gain  a  lot  of  experience  of  doing 
parallel  computation  by  using  a  networked  workstation 
clusters  even  though  we  don*t  have  any  access  to  a  par¬ 
allel  computer.  We  found  that  it  is  very  easy  to  de¬ 
velop  and  implement  a  parallel  algorithm  under  PVM 
although  the  debugging  is  difficult.  However,  the  disad¬ 
vantage  of  using  this  distributed  memory  is  that  more 
waiting  time  will  be  spent  on  message  passing.  The  rea¬ 
son  why  the  speed  up  for  this  parallel  algorithm  can  only 


reach  to  a  maximum  of  1.33  is  due  to  fine  grain  size  and 
unbalanced  load.  In  addition,  in  order  for  the  paral¬ 
lel  algorithm  to  be  advantageous  over  a  sequential  algo¬ 
rithm,  a  lot  of  sweeping  must  be  done  in  the  computa¬ 
tion.  This  will  cause  the  round-off  error  being  significant. 
We  will  look  for  other  parallel  algorithms  for  computa¬ 
tion  of  analyses  of  regression  and  experimental  designs. 
A  more  user  friendly  extension  of  PVM  called  Hetero¬ 
geneous  Network  Computing  Environment(HeNCE)  can 
be  used([5]). 

References 

[1]  Geist  A.,  Beguelin  A.,  Dongarra  J.,  Jiang  W., 
Manchek  R.,  Sunderam  V.  (1993),  PVM3  User’s 
Guide  and  Reference  Manual.  Oak  Ridge  National 
Laboratory. 

[2]  Goodnight  J.H.  (1978).  The  Sweep  Operators: 
Its  Importance  In  Statistical  Computing.  Interface 
Foundation. 

[3]  Heiberger  R.(1989).  Computation  for  the  Analysis 
of  Designed  Experiments.  John  Wiley  and  Sons. 

[4]  Dongarra  J.,  Geist  A.,  Manchek  R.,  Sunderam  V. 
(1993).  Integrated  PVM  Framework  Supports  Het¬ 
erogeneous  Network  Computing.  Oak  Ridge  Na^ 
tional  Laboratory. 

[5]  Dongarra  J.,  Geist  A.,  Manchek  R.,  Sunderam  V. 
(1994).  The  PVM  Concurrent  Computing  System: 
Evolution,  Experience  and  Trends.  Oak  Ridge  Na¬ 
tional  Laboratory. 

[6]  Kennedy  W.J.,  Gentle  J.E.  (1980).  Statistical  Com¬ 
puting.  M.  Dekker  Inc, 

[7]  Douglas  C.,  Mattson  T.,  Schultz  M.  (1993).  Paral¬ 
lel  Programming  Systems  for  Workstation  Clusters, 
Computer  Research  Report,  Yale  University. 

[8]  Quinn  M.  (1994),  Parallel  Computing:  Theory  and 
Practice.  2nd  edition.  McGraw-Hill  Inc. 


S.  G.  Eick  and  P.J.  Lucas  143 


Graphically  Analyzing  Computer  Log  Files 


Stephen  G.  Eick  and  Paul  Lucas 

AT&T  Bell  Laboratories 
RoomfflC--lG-351 
1000  East  Warrenville  Road 
Naperville,  IL  60566 
eick@research.att.com 


Keyvi^ords:  software  visualization,  dynamic  graphics,  log  files,  Unix  commands 

SUMMARY 

Computers  generate  log  files  containing  reports  on  system  performance,  status,  and  faults.  To  analyze  these 
log  files  more  efficiently,  we  have  developed  an  interactive  visualization  system,  SeeLogJ**  that  displays 
temporal  patterns  and  facilitates  exploratory  analysis  of  large  log  files.  We  apply  our  system  and  visualization 
techniques  to  analyze  command  accounting  log  files  from  a  Unix  compute  server,  although  our  motivating 
example  was  log  files  generated  by  software  development  lab  testing. 


1.  Introduction 

Many  computer  systems  generate  log  files  as  part 
of  their  normal  operation.  Such  files  typically 
contain  reports  on  system  performance,  status,  and 
software  faults.  The  reports  are  often  free-format 
and  time-stamped.  These  files  are  used  by 
engineers  for  detecting  and  correcting  system 
problems,  hopefully  before  they  become  service- 
affecting.  One  attribute  common  to  many  log  files 
is  that  they  often  contain  many  unimportant 
reports.  These  “noise”  reports  can  clutter  log 
files,  obscure  important  reports,  and  thereby  result 
in  real  problems  going  undetected. 

Although  our  motivating  example  comes  from 
analyzing  log  files  created  during  the  software 
development  process,  our  analysis  technique 
applies  to  other  log  files  equally  well.  To 
illustrate  our  technique,  we  use  the  Unix  System 
V  command  accounting  facility.  This  log  file 
contains  a  report  for  each  command  executed  and 
is  automatically  generated  as  part  of  standard 
operations.  It  contains  a  detailed  history  of  the 
machine’s  activity  and  it  is  used  by  system 
administrators  for  performance  tuning,  security 
monitoring,  and  could  be  used  for  usage  billing. 
We  find  it  particularly  interesting  because,  by 
studying  the  logs  from  one  of  our  own  machines, 
we  gain  insight  into  how  we  in  a  research 
department  use  computing  resources.  By 
analyzing  this  data,  we  have  gained  some 
interesting  insights  into  our  own  work  patterns. 


2.  Visualization  Technique 

Our  log  file  analysis  paradigm  involves  two  steps: 
parsing  and  visualization.  Parsing  a  complicated 
log  file  involves  lexicographically  scanning  it  to 
note  the  times,  types,  and  locations  of  all  reports. 
This  step  can  be  done  using  tools  like  grep, 
AWK,^^^  Perl,^^^  or  even  a  C  program.  The  data 
from  the  scanning  are  placed  in  a  table  that  is  the 
input  to  SeeLog, 

To  create  a  visual  display  of  a  log  file,  the  reports 
are  arranged  chronologically  and  grouped  by  type. 
Each  report  is  then  represented  as  an  angled  “tick 
mark”  on  a  grid  with  time  running  along  the  pr¬ 
axis  and  report  type  along  the  y-axis.  The  report 
types  listed  along  the  y-axis  may  be  placed  into 
bands  of  related  types.  The  result  is  a  pattern  of 
horizontal  bands,  each  containing  a  number 
related  of  rows,  with  ticks  indicating  occurrences. 
(See  Figure  1.) 

Within  each  band,  there  are  rows  for  the  distinct 
values  of  each  type.  The  type  name  is  printed  at 
the  left  side  of  the  display  and  the  type  value  is 
printed  next  to  its  corresponding  row.  The  rows 
may  be  sorted  in  decreasing  tick  mark  frequency 
or  in  alphabetical  order. 

In  most  datasets,  there  are  several  dimensions  of 
type  information.  For  example,  in  the  command 
accounting  dataset,  the  type  information  includes 
the  user-id,  number  of  characters  transferred,  and 
process  size.  There  are  three  methods  that  the  tick 
marks  encode  type  information:  rows  (primary 
method),  color,  and  angle.  The  color  and  angle  of 
each  tick  mark  may  encode  different  dimensions. 
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but  often  redundantly  encode  the  same  dimension. 
3.  An  Example 

The  display  of  the  log  file  in  Figure  2  shows  ten 
hours  of  data  from  4:00  to  13:59  from  our 
department’s  compute  server.  During  that  period, 
6179  command  accounting  reports  were 
generated.  The  display  shows  two  bands  of 
reports:  system  commands  (upper  band) — those 
executed  by  either  the  root,  adm,  or  various 
daemon  logins  and  user  commands  (lower 
band) — those  executed  by  ordinary  users.  The 
system  commands  are  sorted  alphabetically  by 
command  name;  the  user  commands  are  sorted 
descendingly  by  the  number  of  times  each 
command  was  executed.  The  tick  marks  are 
display  color-  and  angle-coded  by  the  user-id. 
The  user-ids  are  color-coded  according  to  the 
interactive  color  scale  on  the  left  side  of  the 
display.  The  total  number  of  occurrences  for  each 
command  is  shown  on  the  right  side  of  the  display 
in  the  form  of  a  bar  chart,  and  the  total  number  of 
occurrences  for  all  commands  is  shown  on  the 
bottom  of  the  display  in  the  form  of  a  stacked 
histogram.  The  slider  in  the  bottom-left  corner 
controls  the  bin  size  for  the  stacked  histogram  and 
is  currently  set  at  five  minutes. 

Many  things  are  apparent  from  the  display  in 
Figure  2.  The  system  commands  chkconfig 
and  rpc .  mount,  spawned  by  sh  (Bourne  shell), 
execute  continuously  throughout  the  ten-hour 
period.  These  involve  the  network  file  system 
(NFS).  Another  sequence  of  commands  is 
executed  hourly,  on  the  hour,  and  involves 
accounting  and  periodic  administrative  tasks. 

The  user  commands  follow  a  different  pattern. 
The  stacked  histogram  at  the  bottom  of  the  display 
shows  that  there  was  little  user  activity  before  9, 
between  10  and  11  and  during  the  noon  hour.  On 
this  particular  day,  there  was  a  department 
seminar  between  10  and  11  and  a  lunch  for  our 
visitor.  The  most  popular  user  command  was  the 
CC  shell  script,  the  front-end  for  the  C++ 
compiler,  which  executed  1170  times. 

There  are  large  bunches  of  commands  executed  by 
user-id  pjl  (Paul  Lucas).  Those  “waves”  of 
commands  were  all  started  by  the  CC  command. 
(He  was  actually  compiling  SeeLog  a  few  times.) 
The  first  few  were  recompiles  of  selected  object 
files;  the  compile  performed  around  noon  was  a 
complete  recompile. 


Some  of  the  commands  have  tails  indicating  that 
they  ran  for  a  noticeable  length  of  time.  A  few 
commands  have  tiny  tails,  particularly  pjl’s 
makes  and  CCs,  that  are  at  different  heights.  The 
height  of  the  tails  varies  when  they  would 
otherwise  overlap  on  the  display.  A  make 
command  typically  executes  several  other 
commands  in  sequence.  On  a  single  processor 
machine  we  would  expect  the  commands  spawned 
by  the  make  to  be  executed  one  after  each  other 
with  no  overlap.  The  make  command  on  our 
multi-processor  compute  server  can  make  object 
files  in  parallel.  The  overlapping  tails  are 
instances  when  parallelization  occurred,  since  the 
commands  are  executing  concurrently. 

The  first  set  of  user  commands  were  executed  by 
user-id  eick.  He  started  ksh  (Korn  shell)  and 
read  mail  at  6:50  (fi:om  home)  and  executed 
several  other  commands  at  7:47  (at  work). 

4.  Summary 

The  SeeLog  system  embodies  a  graphical 
technique  for  visualizing  large,  computer¬ 
generated  log  files.  The  system  graphically 
displays  log  file  reports  and  provides  interactive 
mechanisms  for  manipulating  the  display.  All  of 
the  reports  are  displayed  on  a  single  grid  as  tick 
marks,  using  position,  color,  and  angle  to  encode 
the  type,  time,  attributes  and  subattributes  of  each 
report.  Using  our  technique  we  have  analyzed  log 
files  with  over  80,000  error  messages,  in  a  fraction 
of  the  time  required  by  conventional  methods. 
This  log  file  analysis  technique  generalizes  to 
analyzing  any  stream  of  time-stamped,  typed 
reports.  This  includes  output  from  transactions 
systems,  data  networks  and  even  electronic-mail 
logs. 
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Classification  (bands,  types) 


Figure  1.  Log  File  Display 

Each  tick  mark  represents  one  report  and  is  positioned  on  a  grid  chronologically  and  grouped  by  type.  The  ^-axis  encodes 
time  and  the  y-axis  type. 


Figure  2.  Who  did  what:  Coded  by  user-id 

Each  tick  mark  represents  one  Unix  command  or  shell  script  that  executed  during  a  ten  hour  period  on  our  compute  server. 
The  tick  marks  are  positioned  on  a  chronological-by-type  grid  and  color-  and  angle-coded  to  show  the  user-id  executing  that 
command. 
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Abstract 

The  Multi-String  Reananging  Memory  (MSRM)  is  a 
computer  memory  system  designed  for  use  with  standard 
(e.g.,  IBM  486  or  larger)  computers.  Simultaneous  input, 
output,  and  data  rearrangement  operations  are  permitted 
when  it  is  installed  in  a  computing  system.  Such  common 
computations  as  formation  of  the  transpose  of  a  matrix, 
order  statistics  ranking  operations,  construction  of  empirical 
cumulative  distributions,  quantiles,  etc.,  require  no  more 
time  than  linear  time. 

Other  operations  for  which  the  MSRM  is  designed 
include  “skimming”  and  searching.  When  the  MSRM  is 
used  for  skimming,  the  largest  (smallest)  m  of  a  list  of  n 
entries  can  be  selected  in  the  amount  of  time  consumed  by  a 
single  scan  of  the  n  entries.  This  use  of  the  MSRM  permits 
such  operations  as  removal  of  outliers,  the  extraction  of  sets 
of  extremal  observations,  and  trimming  of  data.  When  the 
MSRM  is  used  for  searching,  m  entries  may  be  matched 
with  n  entries  in  the  MSRM;  the  amount  of  time  required  for 
this  operation  is  the  amount  of  time  needed  for  transmission 
of  the  m  entries  to  the  MSRM.  Delay  between  successive 
members  of  the  list  of  m  entries  is  not  required. 

Performance  characteristics,  statistical  uses,  and 
intrinsic  cost  of  the  MSRM  are  discussed  in  the  paper. 

0.  Introduction. 

The  Multi-String  Rearranging  Memory  (MSRM)  is  a 
computer  memory  system.  It  is  hardware,  not  software.  It 
consists  of  standard  RAM  and  some  control  circuitry 
suitable  for  VLSI  construction.  The  RAM  can  be  used  as 
ordinary  RAM  storage  when  it  is  not  being  used  for  the 
q)ecial  purposes  described  below.  The  cost  of  the  MSRM  is 
dominated  by  the  cost  of  its  RAM. 

The  functional  characteristics  of  the  MSRM  are 
described  in  Ref.  1.  The  principal  operations  and  their 


performances  will  be  reviewed  in  the  first  section. 
Discussion  of  the  speed  advantages  of  the  MSRM  appears  in 
the  second  section.  The  third  section  describes  some 
advantages  of  its  use  in  statistical  computing.  In  particular,  a 
somewhat  detailed  example  related  to  isotonic  regression  is 
presented.  It  serves  to  indicate  an  advantage  of  the  usage. 

1.  Specialized  Operations. 

A  main  data  management  operation  is  that  of  sorting; 
i.e.,  the  placement  of  data  in  increasing  (decreasing)  order. 
Most  conventional  computers  use  n  log2(n)  serial  operations 
to  sort  data  where  n  is  the  number  of  records.  The  MSRM 
sorts  records  in  linear  time.  More  specifically,  the  n  records 
are  read  in  serially  without  any  delay  between  successive 
records.  When  they  are  written  out  serially,  again  without 
any  delay,  they  will  be  in  sorted  order.  One  can  think  of 
filling  a  pipe;  during  the  input  process  the  records  undergo 
some  rearrangement  The  final  rearrangement  takes  place 
during  the  ouqtut  process. 

The  insert  operation  can  be  performed,  also  in  linear 
time.  On  occasion  we  have  need  to  withdraw  some  data, 
modify  it,  and  put  in  back  in.  This  takes  time  proportional  to 
the  number  of  records  that  are  modified. 

A  data  skimming  operation  is  valuable  and  fast. 
Suppose  we  desire  the  m  largest  (smallest)  records.  These 
can  be  placed  in  the  MSRM  in  the  time  it  lakes  for  a  single 
pass  of  the  data.  Thus  this  operation  is  just  as  fast  as  the 
seemingly  similar  operation  of  screening  all  records  that 
satisfy  a  given  inequality  or  property. 

Basic  searching  operations  can  also  be  performed  in 
linear  time.  If  an  ordered  sequence  of  size  n  is  placed  in  the 
MSRM  and  q  individual  query  records  are  submitted,  then 
these  records  can  be  input  serially,  without  delay,  and  the 
records  with  matching  keys  can  be  located  in  time 
proportional  to  q. 


^TTie  text  of  this  paper  was  prepared  by  R.  R.  Read.  The  portion  of  the  paper  that  pertains  to  the  MSRM  was  obtained  in 
part  from  various  papers  prepared  by  myself.  We  suggest  that  any  inquiry  regarding  the  procedures  mechanized  with  the 
MSRM  be  directed  to  me;  it  is  unfortunate  that  proprietary  considerations  interfere  with  publication  of  some  of  the  details  of 
the  MSRM.  Philip  N.  Armstrong 
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There  is  an  additional  advantage  in  that  input  and 
output  operations  can  take  place  simultaneously;  as  one 
record  is  going  in,  another  one  can  be  withdrawn.  If  this  is 
done  in  sort  mode,  then  the  input  records  must  be  for  a  new 
sorted  file.  If  non-destructive  output  from  the  MSRM  is 
desired,  the  records  received  by  the  processor  may  be 
reinserted  into  the  MSRM  either  as  a  new  file  or  into  the 
original  file.  The  bi-directional  transmission  is  indicated  in 
Figure  1. 


General  Computing  System  with  MSRM 
Figure  1 

The  use  of  the  MSRM  relieves  the  computer’s  system 
control  of  many  of  its  tasks.  One  requires  but  a  single 
address  in  the  MSRM  and  writing  the  file,  all  records 
serially,  to  that  address.  On  output,  the  records  are  read 
serially  from  that  address.  Thus  the  MSRM  behaves  as  a 
pipe;  it  has  the  added  advantage  that  the  records  are 
rearranged  while  being  placed  in  and  withdrawn  from  “the 
pipe”. 

2.  Timing  Comparisons. 

The  importance  of  sorting  has  received  some  recent 
attention  [4,  5].  Typically  computer  centers  use  software 
systems  to  perform  sorting  operations  when  sorting  is 
required.  Normally  these  calls  are  not  visible  to  the  user. 
Time  spent  sorting  is  buried  in  the  elapsed  time  of  a  job  and 
the  number  of  calls  to  sorting  operations  is  also  lost. 

Some  idea  of  the  size  and  speed  of  an  MSRM  system 
may  be  gained  by  comparing  it  with  a  large  computer,  e.g., 
an  IBM  9012.  The  MSRM,  operating  at  currently  feasible 
frequency,  is  faster  than  the  IBM  system,  according  to  the 
published  IBM  specifications  [4].  An  MSRM  system  can  be 
consfructed  in  accordance  with  the  parameters: 

Capacity:  1.2  gigabytes;  MSRM  word  size:  4  bytes 

Record  Length  is  any  fixed  number  of  words 

Memory  Input/Output  speed:  10^  words  per  second 
The  amounts  of  time  required  for  sorting  files  of  various 
sizes  with  the  IBM  system  and  with  the  MSRM  are 
tabulated  in  Table  1.  In  it,  the  column  I/M  is  the  ratio 


defined  by  the  amount  of  time  consumed  by  the  RAM 
(IBM)  system  divided  by  the  amount  of  time  consumed  by 
the  MSRM  system. 

The  file  and  record  sizes  used  in  Table  1  may  seem 
larger  than  those  contemplated  in  many  statistical 
computations.  Generally,  the  sorting  advantage  is  a  factor  of 
log2(rt).  The  MSRM  advantage  increases  with  the  number  of 
records  in  a  file  (data  set);  and  never  is  it  at  a  disadvantage. 

Table  1 


MSRM  Sort  Timing 


File  Size 

MSRMElapsed^ 

IBM  Elapsed^ 

Ratio: 

(Megabytes) 

MTime 

I  Time 

I/M 

10 

.25 

8 

32 

20 

.5 

13 

26 

40 

1.0 

26 

26 

80 

2.0 

50 

25 

150 

3.75 

93 

24.8 

300 

7.5 

182 

24.2 

600 

15 

366 

24.4 

1200 

30 

725 

24.2 

3.  Statistical  Computing. 

The  sorting  and  skimming  capabilities  of  the  MSRM 
serve  nicely  for  the  elementary  operations  in  data  analysis. 
Suppose  the  data  are  {x\,  X2, Xn)  and  we  require  the 
order  statistics,  the  empirical  distributive  function, 
histograms,  and  ranks.  The  order  statistics 


^The  quoted  amounts  of  time  do  not  include  time  for 
access  to  the  mass  store,  if  any,  in  which  the  data  is  stored; 
it  is  assumed  that  the  file  passes  to  the  MSRM  at  the  rate  of 
10^  bytes  per  second.  It  is  also  assumed  that,  since  data  can 
IBSS  from  the  MSRM  in  sorted  order,  that  it  is  not  necessary 
to  record  the  file  in  mass  memory  after  it  is  received  in  the 
MSRM.  The  computed  time  is  thus,  for  the  file  of  300x10^ 
bytes,  300x10^/4x10^  =  7.5  seconds.  This  would  also  be  the 
output  time  if  output  is  required  before  other  uses  for  the 
sorted  file  requires  such  storage. 

^The  IBM  data  is  published  in  Ref.  4.  In  trials  to 
determine  the  accuracy  of  the  data  assumed  here,  an  Amdahl 
computer  was  used  (the  Amdahl  5995-700A  installed  at  the 
Naval  Postgraduate  School  at  Monterey,  CA)  with  the 
collaboration  of  the  Defense  Manpower  Data  Center  at 
Monterey.  The  Amdahl  system  is  somewhat  slower  than  the 
IBM  system,  but  still  the  input/output  time  was  reported  to 
be  negligible  compared  to  the  sort  time.  This  suggests  that 
the  timing  shown  for  the  MSRM  is  at  least  nearly  attainable 
in  the  large  IBM  or  Amdahl  systems  and  neglect  of  the 
input/output  time  is  not  a  distortion  of  the  MSRM 
performance. 
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can  be  obtained  in  linear  time.  The  pairs  {x(jy  Hn) 
effectively  form  the  empirical  cdf.  The  construction  of 
histograms  follows  directly  from  the  order  statistics  for  the 
empirical  cdf.  Indeed  the  latter  is  useful  for  the  construction 
of  histograms  having  equi-probable  cells. 

The  generation  of  rante  seems  to  require  an  extra  step: 
form  the  pairs  as  records  and  rearrange  the  file 
according  to  increasing  xj  in  the  MSRM.  The  ouq)ut  file  has 
records  of  the  form  (JC(j>y(i>  0  andy(i),  ,..,Xn)  wiW  b® 
inverse  permutation  of  the  ranks.  The  indices  i  are  appended 
during  die  output  process.  Then  the  pairs  (/(j),  0  are  input  to 
the  MSRM  and  sorted  according  to  increasing  values  of  the 
{/(,■)} .  The  output  will  be  the  pairs  (/,  rj)  where  [rj]  is  the  set 
of  ranks. 

The  skim  operation  allows  immediate  trimming  of  the 
data  set  without  prior  ordering.  The  top  m  =<pn>  of  the  data 
(where  0  <  p  <  1)  can  be  removed  with  a  single  pass.  That 
top  portion  will  reside  in  the  MSRM  and  can  be  withdrawn 
in  sorted  order.  Such  operations  permit  rapid  access  for  the 
study  of  extrema  and  outliers.  The  remainder  of  the  data  will 
be  in  its  original  order 

Both  tails  can  be  trimmed  virtually  simultaneously:  the 
observations  enter  a  single  address  of  the  MSRM  which  will 
accept  (and  order)  the  first  m.  As  the  next  observadon  goes 
in  it  is  compared  with  the  smallest  of  the  first  m;  the  larger 
of  these  two  replaces  the  smaller  which  in  turn  is  sent  to  a 
different  MSRM  file  address  (for  skimming  the  lower  <qr> 
observadons  in  a  similar  manner).  This  condnues  serially 
and  the  file  retained  at  the  original  address  will  contain  the 
<pn>  largest  while  the  file  retained  at  the  second  address 
will  contain  the  <qn>  smallest.  These  latter  will  be  in 
decreasing  order.  Of  course  p  +  ^  <  1,  The  remainder  of  the 
data  returns  to  ordinary  storage  (or  perhaps  is  discarded). 

The  summadon  of  the  observadons  in  a  large  data  set 
may  be  accomplished  profitably  using  the  MSRM  in  some 
instances.  If  we  insert  (xi,  ...,x>i)  for  withdrawal  in 
decreasing  order  of  magnitude,  then  a  sharp  approximadon 
to  the  total  may  be  computed  without  summing  them  all. 
More  specifically  we  can  accumulate 

Sm-X(l)+X(2)+  ...+JC(w) 
and  the  error  in  is  dominated  by 

|n-m|  •  lx^+i)|. 

We  close  with  a  more  complex  example  of  exploitadon 
of  the  MSRM  in  stadsdcal  computing.  It  serves  to  illustrate 
the  effecdve  use  of  the  MSRM  in  a  stadsdcal  esdmadon 
method.  Consider  isotonic  regression  in  the  simply  ordered 
case.  We  use  the  notadon  of  [2, 3].  We  hope  to  convince  the 
reader  that  the  pool-adjacent-violators-algorithm  (PAVA) 
can  be  modified  to  exploit  the  capabilides  of  the  MSRM  and 
that  the  estimates  can  be  produced  in  less  time. 


The  setting  is  a  set  of  k  linearly  ordered  enddes 
xi<xz<xi< ...  <xk 

and  to  each  thae  is  a  value,  g/  =  g(x,*)  and  a  weight  w,-  >  0. 
The  goal  is  to  compute  the  isotonic  regression  values  (gi*) 
for  j  =  1 . k.  These  are  the  ordered  values 

(1)  gi*^g2*£...£8k 

that  most  closely  resemble  the  original  {g,-}.  The  sense  for 
which  this  holds  is  described  in  the  first  chapters  of  [2, 3]. 
The  development  considers  the  cumuladve  sum  diagram 
(CSD)  defined  by  the  scatter  plot  of  [Wj,  G j]  for  j  = 
0, 1, ...,  k  where  Wq  =  Go  =  0  and 

1  1 

Wj  =  ^Wi  and  Gj  =  ^Wigi. 

i=l  t=l 

See  Figure  2  for  an  example.  The  development  of  the  PAVA 
involves  the  construction  of  the  greatest  convex  minorant 
(GCM)  of  the  CSD;  i.e.,  the  supremum  of  all  convex 
fiincdons  whose  graphs  are  below  the  CSD.  It  is  known  that 
g*  is  the  left  derivadve  of  the  GCM  at  W/  for  j  =  1, . . .,  k. 

The  first  step  is  to  check  whether 

gl<g2<...<gjfc 

is  the  original  condidon.  If  so  then  g*  =  gi  for  all  i  =  1, ...,  k 
and  there  is  nothing  more  to  be  done.  The  MSRM  can  make 
this  check  in  linear  time.  (If  this  condidon  is  satisfied  then 
the  CSD  is  a  scatter  plot  which,  when  the  successive  points 
are  connected  with  straight  line  segments,  forms  a  convex 
function.) 

If  the  first  check  fails,  then  the  PAVA  seeks  a  set  of 
levels  or  blocks  of  subscripts  so  that  the  {gi*)  are  constant 
within  blocks  but  variable  from  block  to  block.  Initially  each 
subscript  is  a  block.  The  polling  of  subscripts  to  form  larger 
blocks  is  accomplished  through  the  selection  of  “violators”, 
i.e.,  adjacent  pairs  having  the  property  g/  >  g/+i.  This  pair 
combines  two  blocks  into  one  using  the  weight  of  average 

im  gi  +  wi+i  g/+i)  /  (wi  +  wi+i) 

for  its  value  and  (w/  +  wj+i)  for  its  weight.  Then  the 
monotone  inequality  (1)  is  checked  again.  The  algorithm  is 
finished  if  it  is  satisfied;  otherwise  choose  another  violator 
pair  and  repeat  the  method. 

The  construction  of  blocks  is  done  sequentially.  That  is, 
a  block  size  grows  in  increments  of  one  during  each 
iteration.  It  is  possible  to  speed  up  this  process. 

In  the  interest  of  brevity  let  us  suppose  there  are  many 
violators.  Our  goal  is  to  illustrate  the  usefulness  of  the 
MSRM  without  unnecessary  details. 

Form  records  of  the  type 

[Wj,  Gj,J]  foiJ-0, 1, ...,  k 

and  send  this  file  to  an  address  in  the  MSRM  so  that  the 
records  can  be  withdrawn  in  increasing  order  of  the  [Gj).  It 
is  convenient  to  include  the  index  j  in  each  record.  VWien 
withdrawing  records  from  the  MSRM  we  must  make 
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comparisons  and  assign  each  to  either  the  left  pool,  LP,  or 
the  right  pool,  RP,  according  to  whether  y  <  TL  or  ;  >  TR. 
We  also  need  a  base  value  for  computing  slopes;  GL  for  the 
left  pool  and  GR  for  the  right  pool.  Initially  GL  =  GR  = 
min{Gy)  and  TL  =  TR  =  y  of  the  first  record  to  come  out  of 
the  MSRM.  The  common  horizontal  accumulated  weight  is 
WL  =  WR  =  W. 

G 


Scatter  Plot  that  Supports  the  Cumulative  Sum  Diagram 
Steps  to  Construct  the  Greatest  Common  Minorant 

Figure  2 


The  diagram  can  be  used  to  visualize  an  iterative  step  in 
the  construction  of  the  GCM.  The  records  come  out  in 
increasing  order  of  (Gy).  The  record  is  assigned  to  LP  if 
y  <  TL  and  to  RP  if  y>  TR.  As  the  records  come  out  we 
compute  the  appropriate  slope 

GL-G  G-GR 

WL-W  W-WR 

and  send  records  (SL,  W,f)  to  a  new  file  in  the  MSRM  for 
the  left  pool,  and  records  {SR,  W,  j)  to  another  new  file  for 
the  right  pool.  The  left  pool  is  sorted  in  descending  order, 
(i.e.,  ascending  order  of  magnitude),  and  the  right  pool  is 
sorted  in  ascending  order.  The  iterative  step  stops  when 
eithery  =  0ory  =  i. 

For  definiteness  suppose  we  are  stopped  aty  =  0.  Then 
we  address  the  file  that  contains  the  SLs  (i.e.,  the  left  slopes) 
and  extract  the  first  one  (the  largest  slope).  We  set  g*  =  SL 
for  all  y  <  i  ^  TL;  we  update  TL = y,  WL  =  W  and  purge  the 


left  slope  file.  Next  check  the  updated  (1)  for  violators  at  TL 
or  to  its  left.  The  left  pool  can  be  abandoned  if  there  are 
none.  We  also  return  the  { W,  G,y )  records,  fory  ^  TL,  to  the 
original  MSRM  file  address.  The  others  are  discarded.  Then 
start  another  it^don.  It  may  begin  a  new  left  pool  or  it  may 
complete  the  existing  right  pool  or  both.  Note  that  it  is  neva 
necessary  to  use  slopes  of  segments  connected  to  CSD 
points  that  are  above  zero  for  the  left  pool,  or  above  Gk  for 
the  right  pool.  It  is  clear  that  the  process  will  finish  and 
produce  the  desired  isotonic  regression. 

The  algorithm,  with  obvious  modification,  can  be  used 
to  find  the  convex  hull  of  a  two  dimensional  point  cloud. 
Also  there  is  an  obvious  simplification  to  this  version  of  the 
PAVA.  One  can  add  a  known  constant  to  all  of  the  (gi*)  so 
that  there  are  no  negative  values.  This  will  circumvent  the 
need  for  a  left  pool. 

The  speed  of  the  procedure  rests  on  the  fact  that  there  is 
but  a  minimal  amount  of  addressing.  Each  address  merely 
opens  a  “pipe”  from  which  all  needed  information  appears 
and  in  the  proper  order.  The  number  of  addresses  is  not 
determined  by  Ae  magnitude  of  the  data  set;  it  appears  to  be 
small,  perhaps  3  or  4.  Also,  the  original  PAVA  appears  to 
have  more  intermediate  steps,  more  overt  comparisons,  and 
more  weighted  averages  to  compute.  It  would  be  into'esting 
to  have  timed  comparisons  for  a  variety  of  cases. 

The  advantages  of  using  an  MSRM  lie  in  simplified 
programming,  fewer  address  and  fetch  operations,  and 
greater  speed. 
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Abstract 

Histogram-type  density  estimators  have  some 
notable  computational  advantages  over  other  forms  of 
density  estimation  by  virtue  of  the  WARPing 
algorithm.  However,  traditional  fixed-bin-width  have 
less  than  satisfactory  smoothing  properties,  being  too 
coarse  in  regions  of  high  density  and  too  fine  in 
regions  of  low  density.  Scott  (1992)  suggests  the  ASH 
algorithm  as  a  means  of  overcoming  these  problems, 
but  the  ASH  algorithm  is  computationally  intensive 
somewhat  negating  the  benefits  of  WARPing. 
Wegman  (1975)  proposed  a  variable  bin-width 
technique  for  one  dimensional  density  estimators  and 
used  sieve-type  methods  to  show  strong  consistency 
results  that  did  not  depend  on  smoothness  properties 
of  the  underlying  density.  In  this  paper,  we  extend 
this  idea  to  high-dimensional,  variable  bin-width 
meshes.  The  boundaries  of  the  bins  are  determined 
by  a  random  subsampling  of  the  observations.  An 
extension  of  the  WARPing  algorithm  may  still  be 
used  for  fast  computation.  We  give  combinatorial 
arguments  for  calculating  the  number  of  bins  and  also 
the  conditional  expectation  and  variance  of  the 
number  of  observations  per  bin.  Conditional  on  the 
random  hyper-rectangular  tessellation,  we  calculate 
the  maximum  likelihood  density  estimator. 

Introduction 

In  this  paper,  a  density  estimation  method  is 
developed  that  is  computationally  more  tractable  than 
kernel  density  methods,  and  has  better  smoothing 
properties  than  traditional  fixed  binning  methods. 


The  b^ic  method  is  easy  to  describe  in  one 
dimension.  Randomly  select  a  subset  of  m 
observations  {F*}  from  a  set  of  n  observations  {F}, 
m  <  n,  together  with  the  max{Y}  and  mm{F}.  Order 
the  set  {f*}  in  the  set  |f*,  A  set  of  random 
width  bins  {J5}  can  be  can  be  constructed  using 
adjacent  elements  in  the  set  |f*.  Then  attribute 
the  probability  mass  of  all  observations  in  {F}  to  the 
bins  in  {B),  The  probability  density  on  an  element 
B-  G  iB)  is  the  relative  probability  mass  on  B^ 
divided  by  the  length  of  c/.  Wegman  (1975)  and 
Hearne  and  Wegman  (1991).  There  are  many  ways  to 
generalize  these  results  to  a  d-dimensional  support 
space.  The  generalization  that  we  have  adopted  here 
is  to  define  random-width  d-dimensional  rectangular 
bins  generated  by  a  random  sample  from  the  set  of 
observations. 

Random-width  d-Dimensional  Bin  Tessellation 

Given  a  set  of  n  observations,  {F},  in  a  d- 
dimensional  Euclidian  space,  let  be  the  minimum 
d-dimensional  rectangular  cover  of  {FI.  Each 
observation  Y j£{Y}  can  be  written  in  the  form 
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y^.  =  Then  can  be  defined  by 

the  set  of  maximum  and  minimum  values  for  the  d 
coordinate  axes, 

G  X*  >  min(y*)  A  X*  <  max{y*)}. 

A  cf-dimensional  rectangular  tessellation  of 
can  be  generated  by  selecting  a  random  subsample  of 
N  observations  {Spf}  from  {y}.  For  each  of  the  d 
coordinate  axes  let  be  the  set  of  the 

coordinate  for  all  y  G  {5^}  together  with  max{Y*) 
and  min(y*)*  Let  be  the  ordered  set  of  unique 

elements  in  {•S’V}  —  card|5( .  A  set  of  one 

dimensional  bins,  {b*},  can  be  generated  for  each  of 
the  d  coordinate  axes  by  adjacent  elements  in  the  set 
and  card{s*}  =  s*  —  1.  The  d-dimensional 
rectangular  random  tessellation  of  A^  can  then 

be  generated  by  the  cross  product  of  the  sets  of  one 
dimensional  bins  for  each  coordinate  axis; 

=  and 

m  =  cardi =  0  (*’  “  1  )• 

«•  =  1 

The  upper  bound  on  the  cardinality  of  the  set  of 
one  dimensional  bins  that  are  generated  for  each  of 
the  coordinate  axes  is  s*  —  1  <  iV  -h  1,  1  <  *  <  d,  since 
the  random  sample  {^jv}  observations  that 

contain  max(y*)  or  mm(y*),  observations  are 
recorded  only  to  finite  precision,  and  computers 
operate  on  a  subset  to  the  rational  numbers.  The 
cardinality  of  the  tessellation 
upper  bound,  given  the  random  subsample  {S'yy/} 
m  =  card{ =  H  (*’  -  0  <  (^  + 1  )^. 

In  Figure  1  a  set  of  observations  {y}  in  3?^  have 
values  max(y^),  min{Y^\  ma  x(y^),  and  mxn(y^). 
These  values  define  the  minimum  2-dimensional 
rectangular  cover  A^  of  {yi.  A  random  subsample  of 


observations  is  drawn  from  {y},  {53}  =  Pa)- 

These  three  points  together  with  the  maximum  and 
minimum  values  for  each  of  the  coordinate  axes 
generate  the  set  of  bins  {B^}  of  A^, 


Figure  1 

The  tessellation  of  is  adaptive  in  the 

sense  that  the  elements  of  the  tessellation  tend  to  be 
large  where  the  observations  are  sparse  and  small 
where  the  observations  are  not  sparse. 

Conditional  Expectation  and  Variance  of  the  Number 
of  Observations  per  Bin 

Let  Bf^y  1  <  fc  <  m,  be  the  d-dimensional  bin 
in  the  tessellation  of  and  let  Zf^  be  the 

number  of  observations  in  {V)  that  are  in  The 
expected  value  of  given  the  tessellation  is 

the  number  of  observations  that  might  be  attributed 
to  the  bin  times  the  probability  that  the  d- 
dimensional  random  variable  X  is  in  the  k^^  bin; 

^Z,\{Bi}^  =  in-N)P{X€B,). 

Let  (7*-,  1  <  *  <  d,  be  the  empirical  probability 
mass  on  the  one  dimensional  bin, 

1  <  i  <  —  1,  for  the  coordinate  axis, 
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Using  order  statistical  arguments,  c/.  Rohatgi 
(1976)  pp.575-580,  it  can  be  shown  that; 

^U)  I  {5'}]  =  1  <  i  <  s*  -  1,  and 

Since  the  tessellation  of  is  generated  by  the 

cross  product  of  the  one  dimensional  bins  on  each  of 
the  d  coordinate  axes  then  the  probability  mass  that 
is  on  a  given  d-dimensional  bin  G  ^5^  j>,  given  the 
tessellation  |^n}> 

I  =  n  1  <  *  <  m,  and 

Multiplying  by  the  number  of  observations  that  might 
be  attributed  to  a  d-dimensional  rectangular  bin, 
n  — iNT,  and  applying  the  inequality  bounding  the 
cardinality  of  the  number  of  bins  in  the  tessellation; 

|/Rd\K(n-iV)2(Ar-l)‘* 

A  Class  of  Probability  Density  Estimators 

Let  n  be  the  number  of  observations  in  the  set  of 
observations  {F},  and  let  be  the  number  of 
observations  in  the  rectangular  bin  in  the 

tessellation  {bJ}-  Let  W'’(^^)  be  the  probabilistic 
mass  of  observations  in  the  tessellation  generating  set 
{Sjsf}  that  are  attributed  to  an  adjacent  bin  in  the 
tessellation  G  function  W(  • ).  And 

let  C7j^  be  the  d-dimensional  content  of  the 
element  of  the  tessellation.  Then  we  can  define  a 


class  of  probability  density  estimators  on  a 
tessellation  by; 


/(xG5,)  = 


This  class  of  probability  density  estimators  is 
constant  on  each  bin  in  the  tessellation,  and  the 
content  of  each  of  the  d-dimensional  bins  in  the 
tessellation  is  easily  computed.  The  probabilistic 
mass  attribution  function  W(*)  is  closely  related  to 
the  likelihood  function. 

The  Likelihood  Function 

The  likelihood  function  was  introduced  as  a 
means  for  optimizing  the  parameter  values  in  the 
parametric  density  estimation  setting  so  that  the 
fitted  parametric  function  would  best  fit  a  set  of 
observations.  In  the  nonparametric  setting  the 
likelihood  function  has  utility  if  there  is  a  variable  in 
the  class  of  density  estimators.  The  weight  that  is 
attributed  to  bins  in  the  tessellation  by  observations 
in  {5'^}  is  variable  and  can  be  used  to  optimize  the 
likelihood  function. 

The  likelihood  function  for  this  class  of 
probability  density  estimators  is 

j=i 

the  product  of  the  density  estimates  for  eeich  of  the 
observations.  But  the  class  of  density  estimators  that 
are  presented  here  are  estimators  on  the  set  of  bins  in 
the  tessellation  of  A ^  so  the  likelihood  function  can  be 
reformulated  in  terms  of  the  elements  of  the 
tessellation; 

it = 1 V  ”  ■  ) 
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Taking  the  first  derivative  of  the  log  of  the 
likelihood  function  with  respect  to 

d  ,  r  _  ^i^k)  \ 

+  t^{lo9{H  +  W{^k))  -H^-Ck)} 

If  the  first  derivative  is  set  equal  to  zero  and  solved 
for  then  the  estimator  will  be  optimized, 

either  maximized  or  minimized  depending  on  the  sign 
of  the  second  derivative  of  the  log  of  the  likelihood 
function.  Taking  the  second  derivative  of  the  log  of 
the  likelihood  function; 


dr  ^ 

dW{Nkf  ~kVink  +  W{N,^J 


The  second  derivative  of  the  log  of  the  likelihood 


function  with  respect  to  is  positive  on  all  bins 

in  the  tessellation  that  have  observations  in  them, 


>  0,  and  is  undefined  where  =  0.  The  likelihood 
function  is  thus  convex  and  the  likelihood  function  is 


meiximized  when  the  probabilistic  mass  of  all 


observations  in  {5;^}  are  attributed  to  the  adjacent 
bin  where  will  be  largest. 


A  Random  Bin-width  Warping  Algorithm 

For  the  proposed  probability  density  estimation 
method  to  be  of  utility  it  is  important  that  density 
estimates  be  readily  computable,  given  a  set  of  n 
observations,  {T},  in  a  d-diraensional  Euclidian  space. 
The  principal  computational  complexity  is  in  the 
attribution  of  observations  to  bins  in  the  tessellation, 
of  the  minimum  d-dimensional  rectangular 
cover  of  {T},  In  conventional  fixed  width  binning 
methods  an  algorithm  called  warping  has  been 
developed  that  increases  the  speed  and  reduces  the 


computational  complexity  for  attributing  observations 
to  bins  in  the  tessellation.  This  algorithm  has  been 
extended  to  variable  bin-width  tessellations. 

Given  N  the  number  of  observations  in  the 


random  sample  of  observations  used  to  generate  the 
rectangular  bins  in  the  tessellation,  the  cardinality  of 
the  set  of  bins,  m,  is  bounded  by; 

m  =  card^B*^  =  H  ~  +  0^* 

^  ^  i  =  1 

For  each  coordinate  axis  there  is  an  upper  bound  on 
the  number  of  one  dimensional  bins  that  can  be 
generated.  Let  Bound _ Values [i,  j]  be  a  matrix  with 
the  row,  0  <  i  <  d,  corresponding  to  and 

Bound__Value[i,0]=mm(y*).  Then  for  each  row  i, 
0  <j  <s*  —  1.  Let  Bin_Index[i,  k]  be  a  matrix  with 
the  row  a  vector  of  integer  indices  into  the  matrix 
Bound_Values[f,  j],  with  0  <k  <w\  where  w'  is  the 
selected  number  of  warping  indices  for  the 

coordinate  axis,  s*  —  1  <  u;*. 

Let  6*  =  min(y‘)  and  a*  =  maxiv') -  min{Y\) 

w 

the  coordinate  axis,  0  <  z  <  d.  For  any  point 
X*  £  min{Y*\max{Y*)^  then  the  value 


Index  =  Truncat( 


a* 


is  an  integer  in  the  range  0  <  Index  <  w'.  Let  the 
coordinate  axis  and  the  entry  in  the  matrix 
Bin_Index[z,  k]  be  the  smallest  index  j  into  the 
matrix  Bounds Values[2,j]  such  that 


a*(lndex  +  b*)  <  Bounds Values[z,  j]. 


Then  an  efficient  algorithm  to  compute  the  bin  index 
for  the  coordinate  axis,  0  <  i  <  d,  is  shown  in  the 
following  code  fragment. 
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Get_Bin_Index(i,  a:*) 

Table^Index  =  Truncate((a:*  — 

Index  =  Bin_Index[z,Table_Index] 

While(a:*  >  Bound_Values[f,  Index])  Index++ 
Return  Index 

The  size  of  the  number  of  warping  indices,  u;*,  is 
specified  by  the  user  of  the  density  estimation 
method.  The  question  of  how  large  ly*  should  be  is  of 
interest.  We  want  to  maximize  the  probability  of 
selecting  the  correct  bin  index  on  the  first  attempt  for 
each  of  the  d  coordinate  axes.  The  bounds  on  the 
probability  of  selecting  the  correct  bin  index  on 
the  first  attempt  is; 

P(aj*  <  Bounds Values[i,Bin_Index[i,Table_Index]]) 

1 

The  larger  w*  is  relative  to  s*  —  1,  the  larger  the 
probability  that  the  correct  bin  index  will  be 
computed  on  the  first  attempt.  If  the  density 
function  is  symmetric  then  the  expected  value  of  the 

probability  is  ^ 

w* 

Conclusions  and  Extensions 

Random-width  binning  methods  are  a 
computationally  tractable  alternative  to  fixed-width 
binning  methods.  The  size  of  the  bins  in  a  d- 
dimensional  space  are  adaptive  so  that  the  bins  will 
tend  to  be  large  where  the  observations  are  sparse  and 
small  where  the  observations  are  not  sparse.  Bounds 
on  the  expected  value  and  variance  of  the  number  of 
observations  that  are  attributed  to  each  bin  can  be 
calculated,  given  the  size  of  the  subsample  that  is 
randomly  selected  from  the  set  of  observations  to 
generate  the  d-dimensional  bins.  The  likelihood 


function  is  convex  a  function  that  can  be  maximized 
or  minimize  to  give  a  maximum  entropy  estimate  by 
selecting  the  appropriate  probabilistic  weight 
distribution  function  W(  • ),  c/.  Hearne  and  Wegman 
(1992).  By  applying  an  extension  to  the  WARPing 
algorithm,  the  computational  complexity  of  the 
random-width  binning  method  is  only  slightly  more 
computationally  intensive  than  fixed-width  binning 
methods. 

One  of  the  natural  extensions  to  random-width 
binning  methods  is  to  apply  a  resampling  scheme,  c/. 
Billard  and  LaPage  (1992).  Given  smoothness 
assumptions  about  the  underlying  probability  density, 
then  the  size  of  the  set  of  observations,  the  dimension 
of  the  observations  space,  and  the  expected  value  and 
variance  bound  on  the  number  of  observations  that 
are  attributed  to  each  bin  might  be  used  to  find  the 
optimal  subsample  size,  and  the  number  of  resampling 
repetitions  necessary  to  achieve  the  desired  density 
estimate  smoothness.  Resampling  in  an  optimal  way 
is  believed  to  be  less  computationally  intensive  than 
either  kernel  or  ASH  methods,  c/.  Scott  (1992). 

Bibliography 

Hearne,  L.B.  and  Wegman  E.J.  (1991).  “Adaptive 
Probability  Density  Estimation  in  Lower 
Dimensions  using  Random  Tessellations”, 

Computing  Science  and  Siatisticsy 
Keramidas,  E.M.  (ed.),  23  241-245, 

Interface  Foundation  of  North  America, 

Fairfax  Station,  VA. 

Hearne,  L.B.  and  Wegman  E.J.  (1992).  “Maximum 
Entropy  Density  Estimation  using  Random 
Tessellations”,  Computing  Science  and  Statistics^ 
Newton  J.  (ed.),  24  483-487,  Interface  Foundation 
of  North  America,  Fairfax  Station,  VA. 


L,B.  Heame  and  EJ.  Wegman  155 


LePage,  R.  and  Billard,  L,  (1992).  Exploring  the 
Limits  of  Bootstrap.  John  Wiley  &  Sons, 

New  York. 

Rohatgi,  V.K.  (1976).  An  Introduction  to  Probability 
Theory  and  Mathematical  Statistics. 

John  Wiley  &  Sons,  New  York. 

Wegman,  E.J.  (1975).  “Maximum  Likelihood 
Estimation  of  a  Probability  Density  Function” 
Sankhyd  Ser.  A  37  211-224. 


156  Global  Tree  Optimization 


Global  Tree  Optimization: 

A  Non-greedy  Decision  Tree  Algorithm 

Kristin  P.  Bennett 

Department  of  Mathematical  Sciences 
Rensselaer  Polytechnic  Institute 
Troy,  NY  12180 


Abstract 

A  non-greedy  approach  for  constructing  globally  optimal 
multivariate  decision  trees  with  fixed  structure  is  pro¬ 
posed.  Previous  greedy  tree  construction  algorithms  are 
locally  optimal  in  that  they  optimize  some  splitting  crite¬ 
rion  at  each  decision  node,  typically  one  node  at  a  time. 
In  contrast,  global  tree  optimization  explicitly  considers 
all  decisions  in  the  tree  concurrently.  An  iterative  linear 
programming  algorithm  is  used  to  minimize  the  classifi¬ 
cation  error  of  the  entire  tree.  Global  tree  optimization 
can  be  used  both  to  construct  decision  trees  initially  and 
to  update  existing  decision  trees.  Encouraging  computa¬ 
tional  experience  is  reported. 

1  Introduction 

Global  Tree  Optimization  (GTO)  is  a  new  approach  for 
constructing  decision  trees  that  classify  two  or  more  sets 
of  n-dimensional  points.  The  essential  difference  between 
this  work  and  prior  decision  tree  algorithms  (e.g.  CART 
[5]  and  IDS  [10])  is  that  GTO  is  non-greedy.  For  greedy  al¬ 
gorithms,  the  "best”  decision  at  each  node  is  found  by  op¬ 
timizing  some  splitting  criterion.  This  process  is  started 
at  the  root  and  repeated  recursively  until  all  or  almost 
all  of  the  points  are  correctly  classified.  When  the  sets 
to  be  classified  are  disjoint,  almost  any  greedy  decision 
tree  algorithm  can  construct  a  tree  consistent  with  all 
the  points,  given  a  sufficient  number  of  decision  nodes. 
However,  these  trees  may  not  generalize  well  (i.e.,  cor¬ 
rectly  classify  future  not-previously-seen  points)  due  to 
over-fitting  or  over-parameterizing  the  problem.  In  prac¬ 
tice  decision  nodes  are  pruned  from  the  tree.  Typically, 
the  pruning  process  does  not  allow  the  remaining  deci¬ 
sion  nodes  to  be  adjusted,  thus  the  tree  may  still  be  over¬ 
parameterized.  The  strength  of  the  greedy  algorithm  is 
that  by  growing  the  tree  and  pruning  it,  the  greedy  al¬ 
gorithm  determines  the  structure  of  the  tree,  the  class 
at  each  of  the  leaves,  and  the  decision  at  each  non-leaf 


node.  The  limitations  of  greedy  approaches  are  that  lo¬ 
cally  "good”  decisions  may  result  in  a  bad  overall  tree 
and  existing  trees  are  difficult  to  update  and  modify. 

GTO  overcomes  these  limitations  by  treating  the  deci¬ 
sion  tree  as  a  function  and  optimizing  the  classification 
error  of  the  entire  tree.  The  function  is  similar  to  the  one 
proposed  for  MARS  [8],  however  MARS  is  still  a  greedy 
algorithm.  Greedy  algorithms  optimize  one  node  at  a 
time  and  then  fix  the  resulting  decisions.  GTO  starts 
from  an  existing  tree.  The  structure  of  the  starting  tree 
(i.e.  the  number  of  decisions,  the  depth  of  the  tree,  and 
the  classification  of  the  leaves)  determines  the  classifica¬ 
tion  error  function.  GTO  minimizes  the  classification  er¬ 
ror  by  changing  all  the  decisions  concurrently  while  keep¬ 
ing  the  underlying  structure  of  the  tree  fixed.  The  advan¬ 
tages  of  this  approach  over  greedy  methods  are  that  fixing 
the  structure  helps  prevent  overfitting  or  overparameter¬ 
izing  the  problem,  locally  bad  but  globally  good  decisions 
can  be  made,  existing  trees  can  be  re-optimized  with  ad¬ 
ditional  data,  and  domain  knowledge  can  be  more  readily 
applied.  Since  GTO  requires  the  structure  of  the  tree 
as  input,  it  complements  (not  replaces)  existing  greedy 
decision  tree  methods.  By  complementing  greedy  algo¬ 
rithms,  GTO  offers  the  promise  of  making  decision  trees 
a  more  powerful,  flexible,  accurate,  and  widely  accepted 
paradigm. 

Minimizing  the  global  error  of  a  decision  tree  with  fixed 
structure  is  a  non-convex  optimization  problem.  The 
problem  of  constructing  a  decision  tree  with  a  fixed  num¬ 
ber  of  decisions  to  correctly  classify  two  or  more  sets  is 
a  special  case  of  the  NP-complete  polyhedral  separabil¬ 
ity  problem  [9].  Consider  this  seemingly  simple  but  NP- 
complete  problem  [9]:  Can  a  tree  with  just  two  decision 
nodes  correctly  classify  two  disjoint  point  sets?  In  [4], 
this  problem  was  formulated  as  a  bilinear  program.  We 
now  extend  this  work  to  general  decision  trees,  resulting 
in  a  multilinear  program  that  can  be  solved  using  the 
Frank- Wolfe  algorithm  proposed  for  the  bilinear  case. 

This  paper  is  organized  as  follows.  We  begin  with  a 
brief  review  of  the  well-known  case  of  optimizing  a  tree 
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®  Class  A  ®  Class  B 


Figure  1:  A  typical  two-class  decision  tree 

consisting  of  a  single  decision.  The  tree  is  represented  as 
a  system  of  linear  inequalities  and  the  system  is  solved 
using  linear  programming.  In  Section  3  we  show  how 
more  general  decision  trees  can  be  expressed  as  a  sys¬ 
tem  of  disjunctive  linear  inequalities  and  formulated  as 
a  multilinear  programming  problem.  Section  4  explains 
the  iterative  linear  programming  algorithm  for  optimizing 
the  resulting  problem.  Computational  results  and  conclu¬ 
sions  are  given  in  Section  5, 

GTO  applies  to  binary  trees  with  a  multivariate  deci¬ 
sion  at  each  node  of  the  following  form:  If  a?  is  a  point 
being  classified,  then  at  decision  node  d,  if  xw^  >  7^  the 
point  follows  the  right  branch,  if  xw^  <  7^  then  the  point 
follows  the  left  branch.  The  choice  of  which  branch  the 
point  follows  at  equality  is  arbitrary.  This  type  of  decision 
has  been  used  in  greedy  algorithms  [6,  1].  The  univariate 
decisions  found  by  CART  [5]  for  continuous  variables  can 
be  considered  special  cases  of  this  type  of  decision  with 
only  one  nonzero  component  of  w.  A  point  is  classified 
by  following  the  path  of  the  point  through  the  tree  until 
it  reaches  a  leaf  node.  A  point  is  strictly  classified  by 
the  tree  if  it  reaches  a  leaf  of  the  correct  class  and  equal¬ 
ity  does  not  hold  at  any  decision  along  the  path  to  the 
leaf  (i.e.  xw^  ^  for  any  decision  d  in  the  path).  Al¬ 
though  GTO  is  applicable  to  problems  with  many  classes, 
for  simplicity  we  limit  discussion  to  the  problem  of  clas¬ 
sifying  the  two  sets  A  and  B.  A  sample  of  such  a  tree  is 
given  in  Figure  1.  Let  A  consist  of  k  points  contained  in 
and  B  consist  of  m  points  contained  in  ii”.  Let  Aj 
denote  the  jth  point  in  A. 

2  Optimizing  a  Single  Decision 

Many  methods  exist  for  minimizing  the  error  of  a  tree 


(a)  Tree  found  by  Greedy  LP  Algorithm 


(b)  Tree  found  by  GTO 
Figure  2:  Geometric  depiction  of  decision  trees 


consisting  of  a  single  decision  node.  We  briefly  review 
one  approach  which  formulates  the  problem  as  a  set  of 
linear  inequalities  and  then  uses  linear  programming  to 
minimize  the  errors  in  the  inequalities  [3].  The  reader  is 
referred  to  [3]  for  full  details  of  the  practical  and  theoret¬ 
ical  benefits  of  this  approach. 

Let  artu  =  7  be  the  plane  formed  by  the  decision.  For 
any  point  x,  ifxw  <'f  then  the  point  is  classified  in  class 
A,  and  if  a;u>  >  7  then  the  point  is  classified  in  class  B.  If 
xw  =  7  the  class  can  be  chosen  arbitrarily.  All  the  points 
in  A  and  B  are  strictly  classified  if  there  exist  w  and  7 
such  that 

AjW-y<0  j  =  l...m 
Biw-y>0  i  =  l...k 

or  equivalently 

-A;-ti>-l-7  >  1  j  = 

5<«;-7>  1  i  =  ^  ’ 

Note  that  Equations  (1)  and  (2)  are  alternative  definitions 
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of  linear  separability.  The  choice  of  the  constant  1  is 
arbitrary.  Any  positive  constant  may  be  used. 

If  A  and  B  are  linearly  separable  then  Equation  (2) 
is  feasible,  and  the  linear  program  (LP)  (3)  will  have  a 
zero  minimum.  The  resulting  (i^;,  7)  forms  a  decision  that 
strictly  separates  A  and  B,  If  Equation  (2)  is  not  feasible, 
then  LP  (3)  minimizes  the  average  misclassification  error 
within  each  class. 


or  equivalently 


1  m  .  k 

j=i  «=i 

Vi  >  -7  +  1 

Zi  >  -BiW  +  7  +  1 


Vi  >  0 

Zi>0 


j  =  1 . . .  m 
i  =  1 . . .  ib 


/  Ajw^  \ 

\  AjW^ -y^  +  l<0  / 
or 

/  +7*  +  1  <  0  \ 

AjW^  —  +  1  <0  \ 

\  -AjW* +  y*  +  l<0  / 


(AjW^  -  7^  +  1)+  •  (AjW^  -  7^  +  1)+  =  0 

i-Ajw^  +  7I  +  1)+  .  (Ajw^  -  73  +  1)+. 

(-Ajw*  +  7^  +  1)+  =  0 

where  (C)+  :=  max{^,  0}. 

Similarly  a  point  Bi  €  Bis  strictly  classified  if  it  follows 
the  path  through  the  tree  to  the  second,  third,  or  fifth  leaf 
node,  i.e.  if 


LP  (3)  has  been  used  recursively  in  a  greedy  decision 
tree  algorithm  called  Multisurface  Method-Tree  (MSMT) 
[1].  While  it  compares  favorably  with  other  greedy  de¬ 
cision  tree  algorithms,  it  also  suffers  the  problem  of  all 
greedy  approaches.  Locally  good  but  globally  poor  deci¬ 
sions  near  the  root  of  the  tree  can  result  in  overly  large 
trees  with  poor  generalization.  Figure  2  shows  an  exam¬ 
ple  of  a  case  where  this  phenomenon  occurs.  Figure  2a 
depicts  the  11  planes  used  by  MSMT  to  completely  clas¬ 
sify  all  the  points.  The  decisions  chosen  near  the  root 
of  the  tree  are  largely  redundant.  As  a  result  the  deci¬ 
sions  near  the  leaves  of  the  tree  are  based  on  an  unnec¬ 
essarily  small  number  of  points.  MSMT  constructed  an 
excessively  large  tree  that  does  not  reflect  the  underlying 
structure  of  the  problem.  In  contrast,  GTO  was  able  to 
completely  classify  all  the  points  using  only  three  deci¬ 
sions  (Figure  2b). 

3  Problem  Formulation 

For  general  decision  trees,  the  tree  can  be  represented  as 
a  set  of  disjunctive  inequalities.  A  multilinear  program 
is  used  to  minimize  the  error  of  the  disjunctive  linear 
inequalities.  We  now  consider  the  problem  of  optimizing  a 
tree  with  the  structure  given  in  Figure  1,  and  then  briefly 
consider  the  problem  for  more  general  trees. 

Recall  that  a  point  is  strictly  classified  by  the  tree  in 
Figure  1  if  the  point  reaches  a  leaf  of  the  correct  classifi¬ 
cation  and  equality  does  not  hold  for  any  of  the  decisions 
along  the  path  to  the  leaf.  A  point  Aj  G  *4  is  strictly 
classified  if  it  follows  the  path  through  the  tree  to  the 
first  or  fourth  leaf  node,  i.e.  if 


/  BiW^  —  7^  -I- 1  <  0  \ 
\  -5iti;2  +  72  +  l  <0  / 
or 

/  -BiW^  +7^  -I- 1  <  0  \ 

(  BiW^  7^  +  1  <  0  ) 
\  Biw"^  -  7^  +  1  <  0  / 

or 

/  —BiW^  +7^  +  1  <  0  \ 
\  —BiW^  +  7^  +  1  <  0  / 


or  equivalently 

-  7^  +  1)+  •  (-BiW^  +  7^  +  1)+  =  0 
or 

(-B,u;‘+ 71  +  1)+- (Situ® -73  +  1)+-  .  . 

-  7“  +  1)+ =  0 
or 

(-BiW^  +  7^  +  +  7®  +  1)+  =  0 

A  decision  tree  exists  that  strictly  classifies  all  the 
points  in  sets  A  and  B  if  and  only  if  the  following  equation 
has  a  feasible  solution: 


+  y2j)  •  (^li  +2/3^  +Z4j)  -f 

i=i 

k 

+  «2j)-(t»li+«3,+«4i)  (tlli+V3i)  =  0 

where  =  {AjW^  —  7**  +  1)+  j  =  1 . . .  m 
=(-^;t£>‘'  +  7‘*+l)+ 

=  (B.u;'*  -  7**  + 1)+  i=l...k 

Vdi  -  {-BiW^  +  7*^  +  1)+ 

ford=l...D 

and  D  =  number  of  decisions  in  tree. 

Furthermore,  (it?^,7"),  d  =  1...D,  satisfying  (8)  form 
the  decisions  of  a  tree  that  strictly  classifies  all  the  points 
in  the  sets  A  and  B- 

Equivalently,  there  exists  a  decision  tree  with  the  given 
structure  that  correctly  classifies  the  points  in  sets  A  and 
B  if  and  only  if  the  following  multilinear  program  has  a 
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zero  minimum: 


k 

««.7  “ 

j:=l 

m 

+  *'2<)  •  (^li  +  “3i  +  +  *'3*) 

si.  ydi>AjW^-n^  +  l  i  =  l...m 

+  T**  +  1 

uj.  >  - 7**  +  1  i=l...* 

»<ii  >  -BiW^  +  7**  +  1 
/or  d  =  1 . . ./ 
y,  «i «  >  0 

(9) 

The  coefficients  and  ^  were  chosen  so  that  (9)  is 
identical  to  the  LP  (3)  for  the  single  decision  case,  thus 
guaranteeing  that  «;  =  0  is  never  the  unique  solution 
for  that  case  [3].  These  coefficients  also  help  to  make 
the  method  more  numerically  stable  for  large  training  set 
sizes. 

This  general  approach  is  applicable  to  any  multivariate 
binary  decision  tree  used  to  classify  two  or  more  sets. 
There  is  an  error  term  for  each  point  in  the  training  set. 
The  error  for  that  point  is  the  product  of  the  errors  at 
each  of  the  leaves.  The  error  at  each  leaf  is  the  sum  of 
the  errors  in  the  decisions  along  the  path  to  that  leaf.  If  a 
point  is  correctly  classified  at  one  leaf,  the  error  along  the 
path  will  be  zero,  and  the  product  of  the  leaf  errors  will 
be  zero.  Space  does  not  permit  discussion  of  the  general 
formulation  in  this  paper,  thus  we  refer  the  reader  to  [2] 
for  more  details. 

4  Multilinear  Programming 

The  multilinear  program  (3)  and  its  more  general  for¬ 
mulation  can  be  optimized  using  the  iterative  linear  pro¬ 
gramming  Frank- Wolfe  type  method  proposed  in  [4].  We 
outline  the  method  here,  and  refer  the  reader  to  [2]  for 
the  mathematical  properties  of  the  algorithm. 

Consider  the  problem  min  f{x)  subject  to  x  E  A'  where 
f  :  ^  Rj  A'  is  a  polyhedral  set  in  R^  containing  the 

constraint  2?  >  0,  /  has  continuous  first  partial  deriva¬ 
tives,  and  /  is  bounded  below.  The  Frank- Wolfe  algo¬ 
rithm  for  problem  is  the  following: 

Algorithm  4,1  (Frank- Wolfe  algorithm  [7,  4]) 

Start  with  any  €  X.  Compute  from  x*  as  fol¬ 
lows. 

(i)  u*  E  arg  vertex  min  V 


Stop  if  V 


(in)  =  (1  —  A*)x*  -h  A*t;*  where 

A*  E  /((I  —  A)x*  -f  Av*) 

In  the  above  algorithm  ^^arg  vertex  min’^  denotes 
a  vertex  solution  set  of  the  indicated  linear  program. 
The  algorithm  terminates  at  some  x^  that  satisfies 
the  minimum  principle  necessary  optimality  condition: 
V/(x'^)(x  —  x^)  >  0,  for  all  x  E  A^,  or  each  accumula¬ 
tion  point  X  of  the  sequence  {x*}  satisfies  the  minimum 
principle  [4]. 

The  gradient  calculation  for  the  GTO  function  is 
straightforward.  For  example,  when  Algorithm  4.1  is  ap¬ 
plied  to  Problem  (9),  the  following  linear  subproblem  is 
solved  in  step  (i)  with  (u),  7,  y,  z,  u,  v)  =  x*: 

1 

min  —  +  Vsj  +  ^4/)  + 

w,y,y,z,u,v  m  ^ 

1 

”  +  2^2,)  ‘  (^1/  H-  ysj  -f  ^4/)+ 

y=i 
1  ^ 

I  *  (hi  +  U3i  +  «4i) 

^  i=l 

ivii  +  ni)  + 

1  ^ 

T  •  (Vli  +  U3i  +  U4,)* 

^  1=1 

(Vl.-+V3j  + 

1  ^ 

T  +  ^2i)  •  (hi  +  +  U4j- 

^  1=1 

(hi  +  hi) 

s.t.  Vdj  >  AjW^  -  +  I  For  d= 

^dj  >  -AjW^  4*  7"^  +  1  j  =  1 . . .  m 
>  5,10^  -7^  +  1  i:=  l...k 

vdi  >  +  1 

y,  z,  u,  t;  >  0  fixed  y,  z,  u,  t),  >  0 

5  Results  and  Conclusions 

GTO  was  implemented  for  general  decision  trees  with 
fixed  structure.  In  order  to  test  the  effectiveness  of  the 
optimization  algorithm,  random  problems  with  known  so¬ 
lutions  were  generated.  For  a  given  dimension,  a  tree 
with  3  to  7  decision  nodes  was  randomly  generated  to 
classify  points  in  the  unit  cube.  Points  in  the  unit  cube 
were  randomly  generated  and  classified  and  grouped  into 
a  training  set  (500  to  1000  points)  and  a  testing  set  (5000 
points).  MSMT,  the  greedy  algorithm  discussed  in  Sec¬ 
tion  2,  was  used  to  generate  a  greedy  tree  that  correctly 
classified  the  training  set.  The  MSMT  tree  was  then 
pruned  to  the  known  structure  (i.e.  the  number  of  de¬ 
cision  nodes)  of  the  tree.  The  pruned  tree  was  used  as  a 
starting  point  for  GTO.  The  training  and  testing  set  error 
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of  the  MSMT  tree,  the  pruned  tree  (denoted  MSMT-P), 
and  the  GTO  tree  were  measured,  as  was  the  training 
time.  This  experiment  was  repeated  for  trees  ranging 
from  3  to  7  nodes  in  2  to  25  dimensions.  The  results  were 
averaged  over  10  trials. 

We  summarize  the  test  results  and  refer  the  reader  to 
[2]  for  more  details.  Figure  3  presents  the  average  results 
for  randomly  generated  trees  with  three  decision  nodes. 
These  results  are  typical  of  those  observed  in  the  other 
experiments.  MSMT  achieved  100%  correctness  on  the 
training  set  but  used  an  excessive  number  of  decisions. 
The  training  and  testing  set  accuracy  of  the  pruned  trees 
dropped  considerably.  The  trees  once  optimized  by  GTO 
were  significantly  better  in  terms  of  testing  set  accuracy 
than  both  unpruned  and  pruned  MSMT  trees. 

The  computational  results  are  promising.  The  Frank- 
Wolfe  algorithm  converges  in  relatively  few  iterations  to 
an  improved  solution.  However  GTO  did  not  always  find 
the  global  minimum.  We  expect  the  problem  to  have 
many  local  minima  since  it  is  NP-complete.  We  plan  to 
investigate  using  global  optimization  techniques  to  avoid 
local  minima.  The  overall  execution  time  of  GTO  tends 
to  grow  as  the  problem  size  increases.  Parallel  compu¬ 
tation  can  be  used  to  improve  the  execution  time  of  the 
expensive  LP  subproblems.  The  LP  subproblems  (e.g. 
Problem  (9))  have  a  block-separable  structure  and  can 
be  divided  into  independent  LPs  solvable  in  parallel. 

We  have  introduced  a  non-greedy  approach  for  opti¬ 
mizing  decision  trees.  The  GTO  algorithm  starts  with  an 
existing  decision  tree,  fixes  the  structure  of  the  tree,  for¬ 
mulates  the  error  of  the  tree,  and  then  optimizes  that  er¬ 
ror.  An  iterative  linear  programming  algorithm  performs 
well  on  this  NP-complete  problem.  GTO  optimizes  all 
the  decisions  in  the  tree,  and  thus  has  many  potential  ap¬ 
plications  such  as:  decreasing  greediness  of  constructive 
^tlgorithms,  reoptimizing  existing  trees  when  additional 
data  is  available,  pruning  greedy  decision  trees,  and  in¬ 
corporating  domain  knowledge  into  the  decision  tree. 
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Abstract^ 

Most  algorithms  which  induce  model  structure  from 
sample  data  proceed,  to  varying  degrees,  "greedily".  That  is, 
they  sequentially  add  to  the  current  model  the  candidate 
component  which  works  best  with  the  existing  structure. 
(Such  components  include  a  linear  term  with  stepwise 

regression,  a  small  polynomial  with  GMDH-like  methods^, 
or  a  threshold  split  with  decision  trees.) 

This  greedy  search  procedure  is  relatively  fast,  but  is  not 
optimal,  as  there  can  exist  models  within  the  "reachable" 
space  which  have  less  complexity  and/or  greater  accuracy  on 
the  training  data.  Indeed,  this  difference  in  training  perfor¬ 
mance  between  optimal  and  greedy  models  can  be  surpris¬ 
ingly  large.  Still,  it  is  not  clear  how  much  greediness  hurts 
in  practice,  and  whether  greedy  models  typically  under 
perform  on  unseen,  but  similar  data. 

Here,  we  review  example  effects  of  greediness  in  regres¬ 
sion  to  motivate  study  of  the  issue  with  another  popular 
model  form:  decision  trees.  A  new  tree  algorithm,  "Texas 
Two-Step",  is  introduced  which  looks  ahead  one  more  gener¬ 
ation  than  standard  procedures.  In  other  words,  it  judges  a 
potential  split  not  by  how  the  resulting  child  nodes  turn  out, 
but  by  how  the  grandchildren  do.  Preliminary  results  are 
compared  on  a  recent  field  application:  identifying  a  bat's 
species  by  its  chirps. 

1 .  Automated  Induction 

Inductive  algorithms  are,  at  one  level,  "black  boxes"  for 
developing  classification,  estimation,  or  control  models 
from  sample  data.  They  automatically  search  a  vast  space  of 
potential  models  for  the  best  inputs,  structure  (terms  and 
interconnections),  and  parameter  values.  The  models  are 
pieced  together  in  a  stepwise  manner  into  a  feed-forward 
network  (e.g.,  tree)  of  simple  nodes.  The  better  methods 
also  prune  unnecessary  terms  or  nodes  from  the  model, 
thereby  regulating  complexity  to  reduce  the  chance  of 
overfit.  Overfit  models  are  over-specialized  to  the  training 


^This  work  was  partially  supported  by  an  NSF  Research 
Associateship  in  Computational  Science  and  Engineering. 

^Group  Method  of  Data-Handling  (Ivakhenko,  1968).  See  also 
the  book  edited  by  Farlow,  1984. 


data  and  generalize  poorly  (fail  on  new  data).  This  is  widely 
held  to  be  the  chief  danger  of  using  inductive  methods. 

Complexity  is  regulated  either  through 

1)  term  penalties^  as  with  model  selection  criteria  such 
as  Cp  (Mallows,  1973)  and  Minimum  Description 
Length,  MDL  (Rissanen,  1978), 

2)  roughness  penalties  (integrated  second  derivatives  of 
the  estimation  surface),  or 

3)  tests  on  withheld  data  (e.g.,  V-fold  cross-validation). 

The  penalties  add  to  an  error  measure,  and  models  having  the 
lowest  combined  score  are  judged  the  best  candidates  for  use. 

Stepwise  regression  can  be  considered  a  low-level  auto¬ 
mated  induction  algorithm.  Though  the  set  of  possible 
models  (linear  combinations  of  a  subset  of  original  candidate 
inputs)  is  quite  constrained,  the  procedure  does  identify 
wWch  variables  to  employ  and  can  increase  or  reduce  the  size 
of  the  set  under  consideration. 

In  contrast.  Artificial  Neural  Networks  (ANNs)  are  not 
inductive  methods  by  the  definition  used  here,  as  their  struc¬ 
ture  is  fixed  a  priori?  They  can  more  precisely  be  viewed 
as  a  class  of  nonlinear  models  whose  parameters  are  typi¬ 
cally  set  through  a  local  gradient  search  called  back-propaga- 

tion?  (One  suspects  that  ANNs,  which  can  perform  well 
even  when  they  appear  over-parameterized,  may  avoid  overfit 
partly  because  of  the  weakness  of  this  search  algorithm!  It 
is  possible  that  improvement  of  the  search  procedure 
without  simplification  of  the  model  structure  may  result  in 

better  training  but  worse  out-of-sample  performance.)^ 

Leading  automated  induction  methods,  using  "building 
blocks"  consisting  of  logistic  functions,  splines,  polynomi¬ 
als,  planes,  non-parametric  smoothes  of  weighted  sums,  etc. 
-  are  briefly  described  in  (Elder,  1993)  along  with  their  chief 
strengths  and  weaknesses.  Here,  we  focus  on  one  of  the 


^Removing  small  terms  within  ANN  nodes  does  not  address 
over-parameterization,  where  useless  terms  can  appear 
significant  though  their  coefficients  collectively  cancel.  (The 
dangers  of  collinear  variables  in  regression  are  analogous.) 

^This  iterative  search  converges  relatively  slowly  to  a  local 
minimum  in  parameter  space,  and  it  has  recently  been  shown 
(Mulier  and  Cherkassky,  1993)  that  the  presentation  order  of 
the  data  affects  the  particular  minimum  found. 

^If  this  danger  is  real,  then  the  "greedy"  nature  of  the  gradient 
search  may  have  benefits  as  well. 
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#Teriiis  in  Model 


Figure  1:  Greedy  vs.  Optimal  Subset  Selection 

latter:  greediness,  and  look  briefly  at  its  effect  on  regression 
and  decision  trees. 

2.  Subset  Selection  in  Regression 

Due  to  the  combinatorial  explosion  of  a  trial-and-error 
search  process  (the  methods  are  at  least  polynomial  in  the 
inputs  and  often  exponential),  a  greedy  heuristic  is  often 
employed:  models  are  constructed  in  stages,  and  only  the 
current  step  is  optimized  at  a  given  time.  Forward  selection 
finds  the  single  best  term,  then  adds  to  it  the  term  which 
works  best  with  the  first,  then  the  one  which  best  assists  the 
pair,  and  so  on.  (Note  that  this  is  very  much  more  useful 
than  a  "first  impression"  model,  which  ranks  the  candidate 
terms  according  to  their  individual  performance  and  employs 
the  top  K,)  Reverse  elimination  begins  with  a  "full"  model 
and  sequentially  removes  the  least  useful  term. 

A  combined  method,  stepwise  selection  (e.g.,  Draper 
and  Smith,  1966)  considers  removing  variables  after  each 
new  variable  is  introduced.  The  standard  selection  mecha¬ 
nism,  checking  "F-to-enter"  and  "F-to-exit"  significance 
values,  is  a  kind  of  heuristic  term  penalty  method,  but  not  a 
correct  use  of  F-tests.  (The  static  significance  measure  is 
invalid  in  the  dynamic  modeling  situation  and  can  lead  to 
highly  inflated  confidences  in  the  resulting  parameter  values; 
see,  e.g..  Miller,  1990). 

This  greedy  growth  strategy  makes  the  search  feasible 
and  often  discovers  useful  features,  but  can  miss  "reachable" 
structure  in  the  data;  that  is,  within  the  form  of  the  basis 
functions  employed.  For  example,  given  F=  {1,1,1,!}, 
X7={1,1,1,0},  X2={1,1,0,0},  Xj=(0, 0,1,1},  a  stepwise  pro¬ 
cedure  would  first  choose  x}  with  which  to  estimate  F,  and 
then  seek  to  add  another  x.  However,  an  exact  model,  Y  = 
^2  +  would  not  include  that  single  best  input. 
Surprisingly,  even  if  there  is  agreement  between  the  forward 
and  backward  procedures  on  the  best  model  of  each  size,  they 
can  differ  by  an  arbitrarily  large  amount  from  some  of  the 
best  subsets  (Berk,  1978). 


For  example,  Desroachers  and  Mohseni  (1984)  presented 
a  purportedly  optimal  algorithm  for  model  selection,  and 
demonstrated  it  on  a  problem  of  estimating  rocket  engine 
temperature  (from  Lloyd  and  Lipow,  1962),  where  their 
small  set  results  agreed  with  earlier  analyses  by  Draper  and 
Smith  (1966).  However,  the  approach  turned  out  to  be  a 
version  of  forward  selection.  To  compare  these  models  with 
optimal  subsets  (of  the  candidate  set  defined  by  Desroachers 
and  Mohseni),  a  new  technique  for  term  elimination  had  to 
be  developed  (Elder,  1990).  Figure  1  shows  the  SSE  of  the 
greedy  and  optimal  models  of  each  size.  The  former  leveled 
off  at  a  limit  of  40,  while  the  latter  were  able  to  reach  nearly 
the  minimum  error  possible  for  the  data  (approximated  by 
the  Y  axis  base).  Clearly,  greedy  methods  can  be  improved 
upon  significantly,  in  training,  on  real  applications. 

For  regression  model  building,  a  logical  extension  of 
the  greedy  growth  strategy  (while  stopping  short  of  the  hope 
of  "optimal"  models)  is  to  add  chunks  of  terms  at  a  time, 
rather  than  just  one.  This  is  the  heart  the  approach  taken  in 
GMDH-like  techniques,  such  as  ASPN  (Algorithm  for  the 
Synthesis  of  Polynomial  Networks,  Elder,  1985).  There, 
sets  of  several  terms,  employing  a  few  independent 
variables,  are  considered  for  inclusion  simultaneously,  then 
pared  down  by  reverse  elimination.  Nodes  of  such  equations 
are  built  up  until  the  added  complexity  cannot  be  justified, 
according  to  a  penalty  criterion  —  either  Predicted  Squared 
Error  (A.  Barron,  1984)  or  MDL.  An  ASPN  regression 
network,  such  as  that  shown  in  Figure  2,  can  have  multiple 
layers  of  diverse  nodes,  each  with  several  terms,  resulting  in 
a  flexible  compound  function  form. 

Extensive  comparison  with  more  greedy  algorithms  has 
yet  to  be  performed,  but  several  researchers  have  successfully 
employed  such  regression  networks  on  applications  which 
had  proven  very  difficult  by  other  methods,  including  auto¬ 
matic  pipe  inspection  (Mucciardi,  1982),  fish  stock  classifi¬ 
cation  (Prager,  1988),  reconfigurable  flight  control  (Elder 
and  Barron,  1988),  tactical  weapon  guidance  (Barron  and 
Abbott,  1988),  and  temperature  distribution  forecasting 
(Fulcher  and  Brown,  1991).  Though  several  areas  of  possi¬ 
ble  improvement  have  been  identified  (Elder  and  Brown, 
1994),  its  success  suggests  that  taking  complex,  rather  than 
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Figure  2:  Sample  Regression  Network 
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simple,  steps  might  improve  other  constructive  algorithms 
for  induction,  such  as  those  used  to  build  decision  trees. 

3.  Constructing  Decision  Trees 


Though  there  are  other  and  earlier  decision  tree  algo¬ 
rithms  (e.g.,  ID3  and  CHAID),  CART  (Classification  and 
Regression  Trees,  Breiman,  Friedman,  Olshen  and  Stone, 
1984)  is  perhaps  the  best  known  and,  arguably,  most  power¬ 
ful.  Some  of  its  nicer  features  include  built-in  cross-valida¬ 
tion,  the  ability  to  handle  categorical  variables  and  missing 
data,  and  a  good  presentation  of  the  output.  (Versions  are 
also  appearing  which  tie  into  commercial  statistical  pack¬ 
ages  and  improve  the  interface.)  Still,  the  basic  classifica¬ 
tion  algorithm  is  very  simple:  try  to  discriminate  between 
classes  by  recursively  bifurcating  the  data  until  the  resulting 
groups  are  as  pure  as  can  be  sustained.  That  is,  start  with 
all  the  training  data  and  choose  the  univariate  threshold  split 
(e.g.,  x3  <  1.14)  which  divides  the  sample  into  two  maxi¬ 
mally  pure  parts  (i.e.,  minimizes  the  sample  variance  of  the 
sum).  (Multi-linear  splits  (e.g.,  xl  +  2x2  <  3)  are  possible, 
but  do  not  seem  to  work  well  in  practice,  perhaps  because  of 
a  poor  internal  search  algorithm.)  Then,  continue  with  each 
of  the  parts  (child  nodes)  until  either  no  splits  are  possible, 
or  the  leaves  (terminal  nodes  of  the  tree)  are  pure  (represent 
only  one  class)  or  have  some  minimum  size.  Then,  CART 
prunes  back  (simplifies)  the  tree,  typically  using  cross-vali¬ 
dation,  to  avoid  overfit.  This  over-training  followed  by 
pruning  was  found  by  CART's  authors  to  lead  to  better  trees 
than  under  the  competing  method  of  trying  to  select  the 
growth  stopping  point. 

For  estimation,  the  leaves  are  set  to  the  mean  or  median 
value  of  the  cases  contained,  forming  a  piecewise-constant 
surface,  as  shown  in  Figure  3  for  a  4-node  tree. 


This  simple  splitting  approach  is  nevertheless  powerful, 
as  a  sequence  of  threshold  questions  quickly  conditions  an 
individual  case.  Each  path  down  the  tree  can  have  its  own 
important  variables  and  outliers  have  no  special  influence. 
Also,  as  with  other  methods  which  implicitly  select 
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e  1:  Greedy  Counter-Example  for  CART 
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variables,  a  user  can  feel  free  to  try  more  candidates  than 
otherwise,  since  CART  will  sift  through  them  unfettered  by 
concerns  about  multicollinearity,  which  can  hurt  regression 
methods.  However,  if  the  candidate  variables  are  jointly  use¬ 
ful,  relatively  independent,  and  not  beset  by  many  outliers, 
other  methods  of  discrimination  can  ou^erform  CART. 

Here,  we  wonder  simply  if  CART's  strategy  of  choos¬ 
ing  the  greedy  split  cannot  be  improved.  As  a  motivating 
example,  consider  the  XOR-like  data  of  Table  1.  CART 
forms  the  approximation  tree  of  Figure  4a  (using  a  leaf  size 
limit  of  <  2  cases).  Its  greedy  search  does  not  find  the 
simpler,  exact  tree  of  Figure  4b. 

To  explore  whether  an  extension  of  the  horizon  to  two 
steps  ahead  would  be  beneficial,  a  decision  tree  algorithm 
called  "Texas  Two-Step"  was  written. 


Figures:  Example  Decision  Tree  Surface 


Figure  4b:  Correct  Decision  Tree 
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4  Texas  Two-Step  (TX2step) 


The  algorithm  TX2step  is  a  slimmed-down  version  of 
CART  for  classification  which  is  not  able  to  handle  missing 
data,  perform  internal  cross-validation,  set  misclassification 
costs,  or  adjust  priors,  and  so  on.  Yet  it  can  look  two  steps 
ahead  to  choose  the  current  split,  and  thereby  finds  the  tree 
of  Figure  4b  given  the  data  of  Table  1.  TX2step  has  one 
other  new  feature::  given  more  than  one  split  which  results 
in  the  same  score,  it  uses  the  split  with  the  largest  relative 
gap  between  border  training  cases.  That  is,  the  tie-breaker  to 
choose  the  dimension  d  of  the  split  depends  on 


,  min  Right Xd/-  max  LeftXi 
maxXd-mnXd 


The  algorithm  can  optionally  be  greedy  as  well;  in  that 
mode,  and  ignoring  gaps,  it  was  validated  on  several  test 
problems  to  reproduce  the  same  tree  as  CART  without 
cross-validation.  Therefore,  to  focus  solely  on  the  greedi¬ 
ness  issue,  TX2step-l  (with  gap  measurement)  was  actually 
run  in  place  of  CART  on  the  example  application  shown 
next.  Training  was  performed  until  all  nodes  were  pure,  but 
those  leaves  with  a  majority  class  having  <3  cases  were 
pruned  back  (i.e.,  re-absorbed  into  their  parent  node). 

5  Example:  Identifying  Bat  Species 

Researchers  from  the  University  of  Illinois,  Urbana/ 

Champaign^  have  measured  bat  echolocation  calls  and 
extracted  time-frequency  features  from  the  signals,  toward 
developing  an  automated  classifying  system  to  track  species 
of  bats  ~  especially  those  considered  endangered.  After  visu¬ 
alization  of  projections  of  the  data  by  the  author,  and  analy¬ 
sis  of  correlations,  multicollinearity,  redundancy,  and  out¬ 
liers  (for  suggested  techniques  see  e.g..  Elder,  1993),  some 
variables  were  eliminated  and  other  new  ones  tried  at  UIUC, 
resulting  in  a  database  of  93  cases,  each  with  15  candidate 
input  features,  representing  5  different  species  (classes)  of 
bats.^»^  One  of  the  better  projections  of  the  data  is  shown 
in  Figure  5,  where  the  classes  are  noted  by  different  sym¬ 
bols.  Note  that  the  groups  do  tend  to  cluster  but  that  a  fair 
amount  of  overlap  is  evident  in  this  (and  all  low-d)  views. 

Trained  on  all  the  data,  the  1-step  tree,  shown  in  Figure 
6,  had  5  splits  (17  prior  to  pruning)  and  made  13  training 
errors.  (In  the  trees,  "Yes"  answers  travel  to  the  left  child; 


^Biologists  Ken  White,  Curtis  Condon,  and  A1  Feng,  and 
Electrical  Engineers  Oliver  Kaefer  and  Doug  Jones. 

single  bat  from  a  sixth  "Long-Eared"  species  contributed  5 
signals  originally,  but  was  removed  since  it  could  be  easily 
distinguished  by  its  low-frequency  signals  and  since  having 
only  one  representative  did  not  allow  proper  evaluation  testing. 

^It  takes  less  than  a  second  on  a  SPARC-2  to  run  the  1-step 
algorithm  on  this  problem,  but  about  75  seconds  for  2  steps. 


Figure  5:  Example  Projection  of  Bat  Classes 

"No"  to  the  right.)  The  2-step  tree  of  Figure  7  started  out 
simpler,  with  14  splits,  but  pruned  less,  ending  with  10 
splits  and  only  5  training  errors.  The  best  root  node  split 
happened  to  be  greedy  but  several  other  splits  were  not.  For 
example,  the  data  in  the  right  child  node  of  the  root,  shown 
in  Figures  8  and  9,  are  those  58  of  93  cases  where  x5  > 
101.5.  The  greedy  tree  was  drawn  to  split  first  on  x20 
<3.59,  then  on  x4  <  44.5,  and  it  missed  6  cases  on  that 
branch.  The  2-step  tree  instead  first  chose  xl  I  <  0.39  -  a 
seemingly  worse  split,  but  when  followed  by  x4  <  43.5  on 
one  branch,  one  which  allowed  it  to  correctly  classify  4 
more  cases.  (The  difficulties  the  split  caused  its  sibling 
branch  were  cleared  up  by  subsequent  splits.)  The  2-step 
cuts  were  often  more  appealing  visually;  that  is,  they 


Figure  6:  CART  (1-step)  Tree  (using  all  data) 


Figure  7:  TX2step  Tree  (using  all  data) 
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accorded  more  with  what  an  analyst  would  do  when  viewing 
two  dimensions  of  data  simultaneously,  rather  than  one. 

As  expected,  the  less  greedy  algorithm  performed  better 
on  training  data.  The  best  test,  of  course,  involves  new 
data.  Since  there  were  not  many  cases,  a  cross-validation 
evaluation  was  performed,  where  all  3-8  signals  for  each  bat, 
in  turn,  were  held  out  of  training  and  independently  run 
down  the  tree  for  testing  (18  runs  for  each  method).  Tables 
2-4  show  the  resulting  confusion  matrices  for  CART, 
TXlstep,  and  a  neural  network  (courtesy  of  Oliver  Kaefer 
and  Doug  Jones  of  UIUC)  trained  on  the  variables  selected 
by  the  two  tree  methods.  Correct  classifications  are  along 
the  diagonal  and  the  hit  percentage  is  shown  in  the  corner. 

CART  gets  43  of  93  signals  correct  (46%),  TX2step  54 
(58%),  and  the  ANN  performs  best  with  64  (69%).  The 
difference  in  accuracy  for  the  tree  methods  appears  more 
critical  when  using  a  voting  scheme,  where  several  different 
signals  from  a  single  bat  are  classified  and  the  majority  class 
is  assigned.  Then,  CART  misses  11  of  the  18  bats  but 
TX2step  only  6.  (The  voting  ANN  misses  just  4.) 

In  this  experiment  (counter  to  our  usual  experience),  the 
tree  methods  were  outperformed  by  an  ANN.  However,  the 
variable  selection  performed  by  CART  and  TX2step  proved 
helpful  to  the  ANN;  one  trained  on  all  35  original  data 
features  got  only  52%  correct  in  bat-wise  cross-validation, 
and  one  trained  on  17  variables  (those  given  as  candidates  to 
the  tree  methods)  was  63%  correct.  Here,  simpler  ANNs 
performed  better  on  new  data.  Clearly,  an  inductive  ANN 
algorithm,  which  adapts  the  network  structure  to  the  data, 
would  be  a  useful  tool.  The  data  characteristics  ~  filtered 
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Figure  8;  CART  View  at  Right  Node 
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Figure  9:  TX2step  View  at  Right  Node 

features,  lack  of  outliers,  clustered  classes  —  which  helped 
the  neural  network  perform  well,  should  also  be  agreeable  to 
exemplar-based  statistical  techniques,  such  as  kernels  and 
nearest  neighbors.  (We  hope  to  soon  try  them,  as  well  as 
regression  networks  and  other  inductive  methods.) 


Table  2:  CART  Table  3:  TX2step  Table  4:  8-Input  Neural  Network 
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6  Performance  on  New  Data:  Remarks 

We  have  seen  that  regression  subsets  and  decision 
trees  can  be  sub-optimal  if  the  single  best  step  is  always 
taken.  This  is  true  in  other  venues  as  well.  Cover  (1974) 
showed  an  investigation  in  which  greed  hurts,  where:  If 
only  one  experiment  is  allowed,  Ej  provides  the  most 
information,  but  if  two  are  possible,  then  independent 
versions  of  the  "worse"  experiment  E2  are  better. 


But  the  degree  to  which  greediness  generally  hurts 
performance  in  practice,  on  new  data,  is  an  open  question. 
Berk  (1978)  sounded  a  slightly  cautionary  note  in  the  case 
of  regression  subset  selection.  Using  nine  well-studied 
data  sets  (having  from  4  to  15  predictors,  13  to  541  cases, 
and  often  more  analysts!),  he  noted  the  maximum  training 
error  difference  between  all-subsets  (optimal)  models  and 
both  1)  forward  selection  and  2)  reverse  elimination  mod¬ 
els.  An  improvement  of  up  to  29%  in  SSE  was  observed. 
Then,  the  sample  distributions  of  each  data  set  were 
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Figure  10:  Ex.  Training  vs.  Evaluation  Improvement  of 
Optimal  over  both  Forward  and  Reverse  Greedy  Methods 

employed  to  generate  synthetic  data  with  known  popula¬ 
tion  characteristics,  and  the  study  again  performed  for  this 
new  evaluation  data.  Figure  10  plots  the  training  vs. 
evaluation  data  differences  for  the  forward  and  reverse 
models  from  the  (Berk,  1978)  study.  Most  evaluation 
differences  were  smaller  and  in  a  tighter  range  (-2  to  7%, 
with  one  exception).  In  two  cases,  a  greedy  method  won 
on  the  evaluation  data  by  a  slight  margin. 

Note  that  the  differences  are  somewhat  exaggerated,  as 
the  maximum  disagreement  between  methods  is  shown, 
not  that  at  some  automated  stopping  point.  For  instance, 
the  two  worst  reverse  values  (one  training,  one  evalua¬ 
tion),  are  for  models  of  size  1  and  2  —  where  the  forward 
method  would  clearly  be  preferable.  Still,  the  greedy 
training  and  evaluation  under-performances  are  correlated, 
and  it  can  tentatively  be  concluded  that  regression  differ¬ 
ences  on  new  data,  while  usually  less  dramatic  than  on 
training  data,  are  still  likely  to  be  significant. 

This  was  also  shown  to  be  the  case  for  decision  trees, 
where  a  version  of  CART  was  out-performed  on  an  exam¬ 
ple  problem  by  TX2step,  which  looks  ahead  an  additional 
step  when  selecting  a  threshold  for  the  current  node. 
Further  research  is  planned  to  examine  the  effects  of 
greedy  model  construction  strategies  in  these  and  other 
inductive  methods,  with  the  hope  of  understanding  better 
the  trade-offs  between  complexity  (in  the  algorithm  as 
well  as  model)  and  accuracy  (training  and  evaluation). 
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Abstract 

Tree  structured  density  estimates  are  produced  via  a 
technique  which  is  similar  to  CART’s  tree  growing  al¬ 
gorithm.  Various  splitting  rules  are  investigated  and 
both  the  univariate  case  and  the  multivariate  case  are 
considered.  For  high-dimensional  densities,  determining 
the  prominent  features  of  the  density  through  an  exam¬ 
ination  of  a  binary  tree  structured  estimate  is  an  alter¬ 
native  to  attempts  at  direct  visualization  of  estimates 
constructed  using  kernals  and  other  methods. 

1  Introduction 

Suppose  that  it  is  desired  to  estimate  the  pdf 
/(ci, . . . ,  of  the  d-dimensional  random  variable  X  = 
where  d  >  1.  One  could  begin  by  letting 
Ai, . . . ,  Am  be  a  partition  of  the  sample  space  and  esti¬ 
mating  the  average  density  for  each  set  in  the  partition. 
Letting  Vi  be  the  content  (or  volume)  of  Aj,  we  have 


The  average  density  for  A*  is  defined  by 


and  so 

fi  =  P{X  e  Ai)/vi. 

Letting  Ni  (i  =  1, . . .,  m)  be  the  number  of  observations 
in  a  random  sample  of  size  n  which  belong  to  A*,  we  have 


is  an  unbiased  estimator  for  /»  which  converges  to  fi  with 
probability  1. 

Now  consider  the  problem  of  estimating  f{x*),  for 
some  X*  belonging  to  the  sample  space.  If  /  is  every¬ 
where  continuous  in  a  neighborhood  containing  x* ,  then 
by  considering  a  suitably  fine  partition,  f*  can  be  made 
to  be  arbitrarily  close  to  where  iff  x*  £  Ai 

(i.e.,  /*  is  the  average  density  for  the  set  in  the  partition 
to  which  X*  belongs).  So  if  the  partition  is  sufficiently 


fine  and  the  sample  size  is  sufficiently  large,  then  the 
density  estimator 

/(®)  = 

t=l 

which  assigns  a  constant  value  to  all  points  belonging  to 
a  given  set  in  the  partition,  should  perform  well. 

If  we  have  a  finite  sample  size,  the  problem  of  selecting 
a  good  partition  to  use  for  the  density  estimator  /  given 
above  is  an  interesting  one.  Even  for  the  simplest  case  of 
d  =  1,  the  problem  of  determining  the  best  partition  on 
which  to  construct  a  histogram  estimator  has  received 
considerable  attention.  Scott  [7]  reviews  various  rules 
which  have  been  suggested  for  choosing  the  bin  width 
for  fixed  bin  width  histogram  estimators,  including  those 
of  Sturges  [8],  Scott[6],  Freedman  and  Diaconis  [4],  and 
others.  Histograms  constructed  using  adaptive  meshes 
(i.e.,  the  selection  of  a  constant  bin  width  is  eschewed  in 
favor  of  a  data-based  method  of  creating  bins  of  unequal 
width)  have  been  considered  by  Wegman  [10,11],  Van 
Ryzin  [9],  and  others,  but  Scott  [7]  warns  that  in  prac¬ 
tice  caution  should  be  taken  in  using  adaptive  methods. 
Compared  to  the  d  =  1  case,  much  less  is  known  about 
histogram  estimators  for  d  >  2. 

Besides  histograms,  other  methods  have  been  sug¬ 
gested  for  density  estimation,  among  them  frequency 
polygons,  average  shifted  histograms,  and  kernal  estima¬ 
tors.  Assuming  that  the  chief  purposes  of  constructing 
a  density  estimate  are  to  determine  key  features  of  the 
density,  discern  the  general  nature  of  the  relationships 
among  the  variables,  and  develop  some  idea  about  what 
the  density  ‘‘looks  like”,  as  opposed  to  merely  wanting  to 
know  the  value  of  f{x)  at  one  or  more  particular  points 
in  the  sample  space,  a  drawback  associated  with  all  of 
these  methods  is  the  difficulty  in  visualizing  or  effectively 
summarizing  the  resulting  estimate  if  d  is  greater  than 
2,  or  perhaps  3.  Computer  graphics  methods  incorpo¬ 
rating  features  such  as  slicing,  contour  shells,  rotation, 
color,  and  stereo  effect  can  certainly  help,  but  for  high¬ 
dimensional  cases  it  can  still  be  very  difficult  to  even  get 
a  rough  idea  about  the  overall  structure  of  the  estimate. 

The  suggestion  put  forth  here  is  to  construct  a  d- 
dimensional  histogram  estimate  having  the  form  given  by 
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f[x)  above,  using  an  irregular  adaptive  partition  consist¬ 
ing  of  d-dimensional  hyper-rectangles  created  by  a  recur¬ 
sive  partitioning  scheme  which  is  similar  to  the  method 
used  by  CART  (see  Breiman  et  al.  [1])  for  growing  clas¬ 
sification  and  regression  trees.  At  each  stage  in  the  cre¬ 
ation  of  the  partition,  an  attempt  will  be  made  to  best 
divide  a  hyper-rectangle  into  two  hyper-rectangles  for 
which  the  average  densities  differ  and  for  which  the  di¬ 
vision  leads  to  an  improved  density  estimate.  By  ex- 
aming  the  splits  leading  to  the  final  estimate,  noting  on 
which  variables  and  which  values  the  splits  are  made  and 
how  the  average  density  estimates  differ  in  the  hyper¬ 
rectangles  created,  it  might  be  possible  to  determine  re¬ 
lationships  among  the  variables  which  may  be  otherwise 
difficult  to  detect.  By  inspecting  the  density  estimates 
for  the  terminal  nodes,  it  will  be  easy  to  locate  the  re¬ 
gions  of  high  density  in  the  sample  space.  So  the  goal  will 
be  to  create  a  useful  density  estimate,  having  the  sim¬ 
ple  form  of  a  binary  tree,  which  will  allow  one  to  detect 
the  salient  features  of  the  density  without  attempting  to 
directly  visualize  the  estimate. 

2  Tree  Growing  Methodology 

For  simplicity,  it  will  be  assumed  that  the  density 
to  be  estimated  has  a  compact  support  which  is  con¬ 
tained  within  a  d-dimensional  hyper-rectangle  that  will 
be  taken  to  be  the  initial  partition  used  in  the  tree  grow¬ 
ing  process.  (If  the  density  does  not  have  compact  sup¬ 
port,  then  one  can  choose  an  initial  hyper-rectangle  that 
contains  the  convex  hull  of  the  data  and  use  the  proce¬ 
dures  described  below  to  produce  an  estimate  of  the  con¬ 
ditional  density  of  X,  given  that  X  belongs  to  the  hyper- 
rectangle.)  The  tree  structured  density  estimate  will  be 
produced  by  recursively  partitioning  the  initial  hyper¬ 
rectangle,  dividing  each  new  hyper-rectangle  which  is 
created  until  there  is  insufficient  evidence  to  warrent  fur¬ 
ther  divisions. 

At  each  step  in  the  tree  growing  procedure,  a  hyper¬ 
rectangle  in  the  existing  partition  of  the  sample  space 
is  split  into  two  hyper-rectangles.  The  quality  of  the 
density  estimate  produced  will  depend  heavily  on  the 
method  used  to  determine  the  variable  on  which  the  split 
should  be  made  and  the  exact  location  of  the  split  (the 
value  of  the  selected  variable  that  corresponds  to  the 
division  into  two  hyper-rectangles).  The  various  rules 
considered  below  are  all  based  on  the  same  general  prin¬ 
ciple:  to  select  the  split  from  among  all  candidates  under 
consideration  which  provides  the  strongest  evidence  that 
the  density  is  nonconstant  over  the  set  in  the  partition 
to  be  split. 

With  all  of  the  rules,  the  location  of  the  splits  will 
be  based  on  the  empirical  marginal  distributions  formed 


from  the  sets  of  observations  belonging  to  each  set  of  the 
existing  partition.  At  each  step  in  the  tree  growing  pro¬ 
cess,  there  will  be  d  conditional  marginal  distributions 
associated  with  a  particular  partition  set  to  consider 
and  the  split  ultimately  selected  will  be  the  “strongest” 
of  all  of  the  splits  which  can  be  made  based  on  the  d 
empirical  distributions,  provided  that  this  best  split  is 
strong  enough.  Therefore,  it  will  suffice  to  develop  split¬ 
ting  rules  for  univariate  random  samples,  have  associated 
ways  of  comparing  the  strengths  of  splits  made  on  differ¬ 
ent  samples,  and  determine  criteria  with  which  to  assess 
the  strength  of  the  strongest  split.  Below,  I  will  describe 
several  methods  of  choosing  a  split  point  based  on  a  set 
of  values  xi, . . . ,  aj„  belonging  to  the  interval  (a,  b]  and 
associated  ways  to  characterize  the  strength  of  the  split 
point  candidates.  For  d  >  2,  the  procedure  for  selecting 
and  assessing  the  overall  best  split  is  given  as  well.  Note 
that  although  conditional  distributions  are  used  to  de¬ 
termine  the  splits,  the  final  density  estimate  needs  to  be 
based  on  the  original  full  sample,  using  fi  =  ni/{nvi). 

The  sample  median  method  prescribes  that  the  inter¬ 
val  (a,  6]  be  split  if  the  location  of  the  sample  median 
is  inconsistent  with  the  hypothesis  that  the  conditional 
density  is  constant  on  (a,  6],  That  is,  if  the  location  of 
the  sample  median  differs  significantly  from  (a  +  6)/2,  it 
will  be  concluded  that  the  conditional  density  is  not  uni¬ 
form  on  (a,  b]  and  the  interval  will  be  split  at  the  sample 
median  into  two  intervals,  one  having  an  estimated  den¬ 
sity  higher  than  the  other  one.  For  the  case  of  n  being 
odd,  an  assessment  of  whether  or  not  the  location  of  the 
sample  median  provides  strong  evidence  against  a  uni¬ 
form  conditional  density  on  (a,  b]  can  be  based  on  the 
value  of 

P(|M-(a-h6)/2|>c), 

where  M  is  the  ((n-f  l)/2)th  order  statistic  from  a  uni¬ 
form  (a,  h]  distribution  and  c  is  the  value  of  the  observed 
difference  |®((n+i)/2)  —  (ct  +  b)/2\.  Using  a  normal  ap¬ 
proximation,  the  above  probability  is  about 


Thus,  as  a  measure  of  the  strengths  of  the  various  can¬ 
didates  for  the  splits,  we  can  use 


and  select  the  split  which  maximizes  this  value,  where 
we  consider  all  d  variables,  for  each  case  letting  n  be  the 
number  of  observations  in  the  partition  set  and  letting 
a  and  b  be  the  endpoints  of  the  hyper-rectangle  corre¬ 
sponding  to  variable  under  consideration.  Although  a 
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modification  should  be  made  if  n  is  even,  the  adjust¬ 
ment  will  be  slight  unless  n  is  rather  small  and  so  in 
pratice  the  2;-score  above  is  used  for  all  split  candidates. 

If  the  maximum  value  of  z  is  greater  than  some  critical 
value  2a72)  l^te  split  is  made  and  the  search  for  the 
next  split  begins.  If  a'  is  taken  to  be 


then  a  decision  to  split  the  hyper-rectangle  based  on  the 
maximum  value  of  z  corresponds  to  a  decision  to  reject 
with  a  size  a  test  of  the  null  hypothesis  that  the  condi¬ 
tional  density  of  X  is  the  joint  density  of  d  independent 
uniform  random  variables  against  the  general  alterna¬ 
tive. 

Letting 


9i  = 


^(i)  +  g(t+i) 
2 


(i=  1), 


gi  will  be  called  the  ith  gap  point.  If  the  location  of  the 
sample  median  in  (a,  b]  results  in  the  decision  to  create 
a  split,  the  split  will  be  made  at  gj^  where  j  —  n/2  if  n 
is  even  and  j  is  either  (n  —  l)/2  or  (n  -{- 1)/2  if  n  is  odd. 

If  the  partition  set  under  consideration  contains  a 
mode,  then  it  may  be  that  none  of  the  d  sample  medians 
will  be  very  far  from  the  center  of  the  hyper-rectangle 
even  though  the  joint  density  is  not  constant.  In  order 
to  prevent  the  recursive  partitioning  from  terminating 
prematurely  with  such  a  partition  set,  a  trial  split  can 
be  made.  If  either  of  the  two  hyper-rectangles  which 
result  from  the  trial  split  produce  a  sufficiently  strong 
split,  then  the  trial  split  is  accepted  and  the  search  for 
further  splits  continues.  Otherwise,  the  trial  split  is  not 
retained  and  the  tree  growth  terminates  in  that  region 
of  the  sample  space. 

The  all  possible  split  points  method  considers  many 
possible  split  points  for  each  interval.  The  set  of  n— 1  gap 
points  will  be  taken  to  be  the  split  point  candidates.  The 
strength  of  the  split  for  the  candidate  s  E  (a,  b]  is  based 
on  the  proportion  of  observations  which  lie  in  (a,  s].  If 
the  proportion  differs  significantly  from  which  is 
the  expected  value  of  the  proportion  if  the  conditional 
density  is  uniform  over  (a,  6],  then  it  is  concluded  that 
the  average  density  for  (a,  s]  is  different  from  the  average 
density  for  (s,  6].  The  strength  of  the  evidence  in  support 
of  the  split  is  measured  by 


i/np,(l-p,)’ 


where  is  the  number  of  observations  in  (a,  s]  and  ps  = 
(5-a)/(6-a). 

When  considering  the  strength  of  the  strongest  split 
overall  using  the  z-score  given  above,  in  addition  to  the 


simultaneous  inference  phenomenon  due  to  the  fact  that 
more  than  one  marginal  distribution  is  being  examined, 
it  is  now  the  case  that  many  possible  split  points  are 
being  considered  for  each  marginal  distribution.  One 
might  think  that  this  additional  source  of  multiple  com¬ 
parisons  can  be  accounted  for  by  determining  if  a  one- 
sample  Kolmogorov-Smirnov  goodness-of-fit  test  for  the 
uniform  distribution  produces  a  significant  result,  but 
in  fact  there  is  a  discrepancy  since  the  Kolmogorov- 
Smirnov  test  depends  on  the  maximum  value,  for  all 
s  E  (a,  6],  of  \ts  -  np,|,  which  is  not  equivalent  to  as¬ 
sessing  the  hypothesis  of  a  constant  density  using  the 
maximum  value  of  the  j2-score  above.  This  suggests  yet 
another  tree  growing  procedure,  called  the  Kolmogorov^ 
Smirnov  method^  for  which  the  one-sample  Kolmogorov- 
Smirnov  statistic  is  computed  based  on  each  of  the  d  em¬ 
pirical  marginal  distributions  and  if  the  largest  of  these 
values  is  sufficiently  large  (say,  corresponding  to  a  rejec¬ 
tion  of  the  null  hypothesis  of  a  uniform  distribution  with 
a  size  a'  test),  then  fhe  hyper-rectangle  is  split. 

The  split  will  be  made  on  the  variable  which  produces 
the  largest  value  of  the  K-S  test  statistic.  The  split  will 
be  made  at  the  gap  point  gj  which  maximizes 

l-Fjr(ffi)  -  ^vnif{9j)U 
where  Fx  is  the  empirical  cdf  and 


=  (»i  -  -  “)• 

It  is  interesting  that  maximizing  l^(Sj)  —  ■Piini/(ffj)l  is 
equivalent  to  maximizing 

J  a 


where  is  the  piecewise  constant  density  estimate 


and  tgj  is  the  number  of  observations  in  (a,  Thus 
splitting  at  the  value  for  which  the  empirical  cdf  differs 
the  most  from  the  cdf  of  a  uniform  (a,  i]  random  variable 
corresponds  to  selecting  the  two-bin  histogram  density 
estimate  which  differs  the  most,  in  an  L\  sense,  from  the 
density  of  a  uniform  (a,  b]  random  variable. 

The  squared  difference  method  is  similar  to  the  K-S 
method  in  that  the  decision  of  whether  or  not  a  split 
should  be  made  is  based  on  the  the  value  of  a  test  statis¬ 
tic  for  a  goodness-of-fit  test.  The  null  hypothesis  that 
the  density  is  constant  over  (a,  b]  is  tested  against  the 
general  alternative  using  the  Cramer- von  Mises  type  of 
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statistic  given  by 


flFxix) 

J  a  L 


x  —  a\  dx 
b  —  a  b  —  a* 


A  split  is  made  on  the  variable  which  produces  the 
largest  of  the  observed  values  of  Q,  provided  that  the 
value  exceeds  the  upper  level  a'  critical  value  of  the  dis¬ 
tribution  free  statistic.  The  split  will  be  made  at  the 
gap  point  gj  which  is  associated  with  the  two-bin  his¬ 
togram  density  estimate  which  differs  the  most,  in  an 
£2  sense,  from  the  density  of  a  uniform  (a,  6]  random 
variable.  That  is,  is  selected  to  maximize 

n^(b  -  a){b  -  gj){gj  -  o)  ’ 

The  likelihood  ratio  method  is  based  on  a  generalized 
likelihood  ratio  test  of  the  null  hypothesis  that  the  condi¬ 
tional  density  is  constant  on  (a,  6]  against  the  alternative 
that  the  density  has  the  form 


where 


_  1  -  (a  -  a)h 
^  -  b^s  ’ 


s  G  (a,  6],  and  h  ^  {b—  a)~^.  A  rejection  of  the  null 
hypothesis  supports  the  conclusion  that  the  density  over 
(a,  6]  is  better  estimated  by  splitting  the  interval  into  two 
pieces  and  estimating  the  density  for  each  piece  seper- 
ately  with  a  constant  than  it  is  by  not  splitting  the  inter¬ 
val  and  using  only  one  value  for  the  density  over  (o,  b]. 
For  a  given  value  of  5,  the  likelihood  function  for  the 
sample  ®i, . . . ,  a5„  is  maximized  by  letting 

h  — 

n{s-ay 


where  is  is  the  number  of  observations  belonging  to  (o,  s]. 
It  follows  that  to  maximize  the  likelihood  over  both  pa¬ 
rameters,  it  is  necessary  to  find  the  value  of  s  that  max¬ 
imizes 

rWn  1*'  ri  -  (t./n)'  * 

[s  —  aj  [  b  —  s 
The  likelihood  ratio  is 

_  r  w(6-5)  ]"r(n-f|)(3-a)]** 

[(n-ti)(6-a)J  [  ii(6-s)  . 

and  the  null  hypothesis  is  rejected  whenever  A  is  suffi¬ 
ciently  small.  The  regularity  conditions  required  to  in¬ 
sure  that  the  null  distribution  of  —2  log  A  is  asymptoti¬ 
cally  X2  satisfied  for  the  testing  situation  under 


consideration.  Nevertheless,  I  found  that  a  splitting  cri¬ 
terion  based  on  the  xl  distribution  works  satisfactorily. 
For  each  variable,  the  function  given  for  A  above  was 
maximized  over  all  choices  of  s  G  Oij  •  •  • » ffn-i}*  bet¬ 
ting  A'  be  the  largest  such  value  obtained  with  all  of  the 
variables,  a  split  is  made  at  the  maximizing  gap  point  if 
-2  log  A'  >  X2,«»  =  -2  log  a'. 

For  all  of  the  methods  described  above,  gap  points 
were  removed  from  consideration  eis  split  points  if  a  split 
at  the  gap  point  would  result  in  a  hyper-rectangle  being 
created  which  did  not  contain  at  least  a  minimum  num¬ 
ber  of  observations.  Values  considered  for  this  minimum 
ranged  from  3  to  50,  but  perhaps  using  a  value  less  than 
3  will  improve  the  accuracy  of  the  estimate  for  the  tails 
of  the  distribution. 

The  regression  tree  method  of  creating  a  tree  struc¬ 
tured  density  estimate  makes  direct  use  of  CART’s  pro¬ 
cedure  for  constructing  a  regression  tree  and  is  rather 
different  from  the  methods  previously  discussed.  First,  a 
regular  partition  is  created  by  dividing  the  initial  hyper- 
rectangle  into  a  large  number  of  identically  shaped  small 
hyper-rectangles.  Next,  the  number  of  observations  in 
each  small  hyper-rectangle  is  determined  and  the  cor¬ 
responding  density  estimate  for  the  partition  set  is  as¬ 
signed  as  the  y  value  corresponding  to  the  x  located  at 
the  center  of  the  hyper-rectangle.  CART’s  regression 
procedure  is  then  used  to  construct  a  regression  tree 
based  on  these  (y,  x)  values.  The  partition  for  this  re¬ 
gression  tree  serves  as  the  partition  for  the  density  es¬ 
timate.  Thus,  the  density  estimate  based  on  the  initial 
fine  partition  is  smoothed  by  CART’s  regression  algo¬ 
rithm  to  produce  an  estimate  based  on  a  coarser  (and 
most  likely  irregular)  partition. 

A  nice  feature  of  the  regression  tree  method  is  that 
cart’s  cross-validation  procedure  can  be  easily  invoked 
to  select  the  right  sized  tree.  In  general,  cross-validation 
could  be  used  in  conjunction  with  the  other  methods  as 
well.  The  criterion  parameter  a  which  partially  governs 
tree  size  can  be  chosen  to  be  large  and  the  minimum 
number  of  observations  allowed  in  a  partition  set  can 
be  made  small,  so  that  too  complex  of  a  tree  is  first 
constructed.  Then  a  cross-validation  based  pruning  pro¬ 
cedure  can  be  performed  to  select  the  tree  which  corre¬ 
sponds  to  the  most  honest  estimate,  using  the  estimated 
likelihood  function 

to  judge  accuracy  (however,  this  may  not  be  satisfactory 
unless  it  were  the  case  that  the  estimated  likelihood  func¬ 
tion  should  be  nonzero  over  the  convex  hull  of  the  data). 

Using  cart’s  regression  algorithm  also  allows  for  an 
easy  implementation  of  linear  combination  splits,  al¬ 
though  this  may  make  it  more  difficult  to  identify  the 
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key  features  of  the  density  with  a  quick  inspection  of  the 
tree.  Some  of  the  other  methods  described  above  are 
not  easily  modified  to  handle  linear  combination  splits, 
but  modifications  of  the  Kolmogorov-Smirnov  method 
and  the  squared  difference  method  can  be  considered. 
Splits  can  be  made  which  produce  the  greatest  overall 
difference  between  the  single  constant  estimate  based  on 
the  partition  set  under  consideration  and  the  two  con¬ 
stant  estimate  which  would  result  from  a  split,  where  the 
integrated  absolute  difference  or  the  integrated  squared 
difference  can  be  used  as  a  measure  of  overall  difference. 

3  Results 

C  programs  were  written  to  compute  density  esti¬ 
mates  based  on  the  methods  described  above.  A  per¬ 
formance  study  was  done  using  samples  of  non-uniform 
pseudo  random  variates,  created  using  standard  tech¬ 
niques  (such  as  those  described  in  Dagpunar  [2],  Devroye 

[3],  and  Knuth  [5]).  Numerous  samples  were  used,  based 
on  combinations  of  normal,  beta,  and  gamma  distribu¬ 
tions  and  having  1  <  d  <  3  and  1000  <  n  <  4000.  The 
well-known  Buffalo  snowfall  data  set  (having  n  =  63) 
was  also  considered.  Values  used  for  a  ranged  from  0.005 
to  0.1.  Usually,  the  choice  of  a  had  little  effect  on  the 
density  and  letting  a  equal  0.05  seems  to  be  a  reasonable 
choice  (but  additional  study  is  warrented  here).  Over¬ 
all,  it  appears  that  the  best  trees  are  created  when  the 
minimum  number  of  observations  allowed  in  a  partition 
set  is  more  responsible  for  the  size  of  the  tree  than  is  the 
value  of  a. 

In  general,  most  of  the  methods  tended  to  produce 
very  similar  results  in  a  lot  of  the  cases  considered.  But 
not  all  of  the  methods  have  been  throughly  investigated 
and  so  any  conclusions  are  tentative  at  this  time.  For 
the  univariate  cases,  the  tree  structured  estimates  were 
typically  quite  a  bit  coarser  then  histograms  constructed 
from  the  same  samples  using  some  of  the  common  fixed 
bin  width  rules.  While  they  might  have  an  overall  lack  of 
accuracy  based  on  a  criterion  such  as  the  MISE  (see  Scott 
[7],  p.  38),  the  tree  structured  estimates  very  rarely  indi¬ 
cated  more  modes  than  what  was  proper,  provided  that 
partition  sets  containing  only  a  small  number  of  obser¬ 
vations  were  disallowed.  Furthermore,  for  all  values  of  d, 
if  partition  sets  containing  only  a  small  number  of  obser¬ 
vations  are  disallowed,  then  the  average  densities  for  the 
partition  sets  (the  /»)  were  often  very  closely  estimated 
by  the  /».  In  general,  the  fi  were  highly  correlated  with 
the  fi  and  in  almost  every  case  considered  the  partition 
set  having  the  largest  estimated  average  density  was  the 
partition  set  having  the  largest  average  density.  So,  all 
in  all,  the  tree  structured  estimators  seem  to  be  very 
good  with  regard  to  finding  modes.  Also,  although  the 


coarseness  contributes  to  an  overall  lack  of  accuracy,  for 
large  d,  a  finer  partition  may  correspond  to  a  tree  struc¬ 
tured  estimate  from  which  it  would  be  more  difficult  to 
identify  the  main  features  of  the  density. 
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Abstract 

Due  to  the  “curse  of  dimensionality”  and  the  expression 
difficulties,  high-dimension  density  estimation  is  usu¬ 
ally  an  ill-defined  problem.  However,  many  practical 
problems  involve  in  issues  of  estimating  high-dimension 
densities.  Certain  kind  of  dimensionality  reduction  is 
then  necessary.  A  computation  extensive  density  estima¬ 
tion  method  is  developed  based  on  the  tree-structured 
methodology.  The  method  has  the  ability  to  identify 
noises  in  the  density  structure,  therefore  greatly  reduces 
the  dimensionality.  It  also  provides  a  simple  way  to 
present  a  high- dimensional  density,  thus  helps  us  to  ex¬ 
plore  the  data  structure.  Simulation  studies  show  that 
the  method  is  often  more  accurate  than  other  regular 
methods  when  the  density  structure  is  complex. 

1  Introduction 

The  problem  of  density  estimation  is  to  construct  a  func¬ 
tion  from  a  random  sample  to  approximate  the  real  den¬ 
sity.  The  purposes  of  density  estimation  is  to  present 
and  explore  the  density  structure  as  well  as  to  obtain 
accurate  estimation  for  other  applications.  A  survey  of 
methods  can  be  found  in  books  like  Silverman  [1986], 
Scott  [1992],  When  data  dimension  is  high,  these  regu¬ 
lar  methods  as  well  as  density  estimation  itself  suffer  the 
so  called  “curse  of  dimensionality”.  It  is  also  impossi¬ 
ble  to  visualize  a  density  surface  when  the  dimension  is 
higher  than  five. 

It  has  been  realized  that  in  practical  situations,  the 
dimension  of  a  true  data  structure  is  often  much  lower 
than  the  number  of  variables  in  the  study.  (Scott  [1992]). 
Therefore  it  is  desirable  to  project  the  high-dimension 
data  onto  an  manageable  lower  dimensional  subspace.  In 
practice,  projection  pursuit  method  (Friedman,  Stuetzle, 
and  Schroeder  [1984])  and  some  regular  techniques  like 
principal  components  decomposition  are  used  to  reduce 
the  dimensionality. 

Complex  environmental  modeling  studies  (Spear, 
Hornberger  [1980])  result  in  many  multivariate  data 


analysis  problems  which  are  essentially  density  estima¬ 
tion  problems.  Many  variables  involved  behave  like 
noises.  In  fact,  it  seems  that  different  sets  of  variables 
feature  local  data  structures  at  different  subspaces,  while 
adding  or  removing  other  variables  in  these  subspaces 
has  little  effects.  If  we  can  locate  these  subspaces  and 
can  identify  the  noises,  the  dimensionality  will  be  greatly 
reduced. 

In  Section  2,  we  will  discuss  about  the  noises  in  den¬ 
sity  structure.  In  Section  3,  the  CART  (Classification 
And  Regression  Tree)  tree  methodology  (Breiman  et  al. 
[1985])  will  be  applied  to  construct  a  density  estimation 
method  according  to  the  insights  obtained  from  Sec¬ 
tion  2.  Although  the  idea  has  been  circulated  among 
the  authors  of  CART  and  other  researchers,  there  are 
many  serious  problems  in  applying  CART  to  density  es¬ 
timation.  We  solved  these  problems  through  defining  a 
roughness  parameter  and  following  a  tree  optimizing  ap¬ 
proach  developed  by  Shang  [1993],  Breiman  and  Shang 
[1994].  Similar  type  of  application  has  been  explored  in 
contingency  table  analysis,  (see  Shang  [1993],  [1994]). 

The  performances  of  the  method  are  studied  through 
simulations.  Spear,  Grieb  and  Shang  [1994]  applied  it  to 
study  the  uncertainty  of  complex  environmental  model¬ 
ing.  Some  techniques  to  enhance  its  interpretations  were 
discussed  in  that  paper. 

2  Noises  in  Density  Structure 

In  this  section,  we  will  provide  an  application  back¬ 
ground  for  multivariate  density  estimation.  This  will 
give  us  some  insights  about  noises  in  density  structure. 

2.1  Pass  Region  and  Density 

An  environmental  model  is  usually  very  complex  and 
is  applicable  to  many  similar  environmental  processes. 
When  it  is  applied  to  a  specific  situation,  it  is  necessary 
to  study  the  sensitivity  of  the  model  to  the  local  phe¬ 
nomena.  Sensitivity  analysis  can  help  us  to  understand 
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more  about  the  model  structure,  to  be  more  efficiently 
monitoring  an  environmental  process  and  to  build  a  more 
reliable  procedure  for  risk  assessment.  A  model  can  be 
simplified  as: 

Y  =  fix,e)  (1) 

where  X  is  background  information  of  a  local  experi¬ 
ment;  Y  is  output;  6  is  the  vector  of  parameters  which 
varies  in  a  parameter  space  0.  At  a  local  point  X,  we 
may  be  interest  in  when  the  model  produces  outputs  sim¬ 
ilar  to  our  observations  or  we  may  be  interest  in  when 
the  outputs  exceed  some  extreme  limits.  We  use  Cy  to 
denote  the  region  of  these  outputs  and  call  it  as  a  crite¬ 
rion  region.  The  parameter  set  which  produces  outputs 
in  Cy  is  called  a  pass  region: 

T  =  {e:f{x,e)eCY}  (2) 

The  pass  region  summarizes  information  of  parameter 
sensitivity  with  respect  to  the  local  background  and  the 
criterion. 

It  is  usually  impossible  to  solve  equation  (2)  analyti¬ 
cally.  An  alternative  is  to  use  Monte  Carlo  simulations. 
If  we  take  a  uniform  sample  from  the  parameter  space 
0,  then  the  points  that  produces  outputs  falling  in  Cy 
construct  a  uniform  sample  from  the  pass  region  Our 
problem  is:  how  to  reconstruct  the  pass  region  from  the 
random  sample.  This  is  equivalent  to  estimate  the  indi¬ 
cator  function  of  V  or  its  smoothed  version: 

if  the  boundary  ofV  is  continuous.  Here  Vg  is  the  volume 
of  N$i  a  small  neighborhood  of  Ug  is  the  volume  of 
Ngf]V,  Vo  is  the  volume  of  V. 

Notice  that  g{0)  is  a  density  function.  The  process  of 
getting  a  pass  point  can  be  defined  as  a  0  — >  0  random 
variable  ^  with  g{9)  as  its  density.  If  there  is  a  prior 
distribution  7r(0)  on  0,  then  the  corresponding  distribu¬ 
tion  of  parameters  in  V  is  just  the  posterior  distribution 
f{6  I  ^).  Now  it  is  this  posterior  distribution  rather  than 
the  region  itself  featuring  the  parameter  sensitivity.  If 
we  take  points  from  0  according  to  7r(0),  then  the  pass 
points  construct  a  sample  from  f{d  1  ^). 

In  any  case,  the  problem  of  parameter  sensitivity  anal¬ 
ysis  is  essentially  a  density  estimation  problem.  We  are 
trying  to  recover  some  main  features  of  a  distribution 
from  samples. 

We  may  replace  the  terms  criterion  region  by  criti¬ 
cal  region^  and  pass  region  by  confidence  region  in  above 
context.  Then  we  are  dealing  with  a  typical  statistics 
problem:  exploring  features  of  a  complex  confidence  re¬ 
gion.  Obviously,  same  idea  can  be  applied  to  many  other 
problems  either  in  statistics  or  in  other  fields. 


2.2  Noises  and  Dimensionality 


The  number  of  parameters  (variables)  in  model  (1)  is 
usually  very  large.  However  it  is  expected  that  only  a 
few  of  them  will  be  useful  in  a  local  experiment.  The 
traditional  sensitivity  analysis  reduces  the  dimension¬ 
ality  through  examining  each  individual  variable.  The 
simplest  way  is  to  compare  the  samplers  range  with  the 
variable's  range.  The  larger  the  difference,  the  more  sen¬ 
sitive  the  variable. 

This  simplest  way  ignores  the  distribution  of  variables 
and  their  interactions.  However  if  there  is  no  inter¬ 
actions  among  variable  and  all  variables  are  uniformly 
distributed,  or  equivalently  if  the  variables  as  a  whole 
follow  a  uniform  distribution  on  a  hyper-rectangle,  the 
rectangle  will  completely  feature  the  local  experiment. 
Individually,  if  a  variable  is  independent  with  the  oth¬ 
ers  and  follows  a  uniform  distribution  in  the  variable’s 
range,  then  the  variable  will  be  useless  in  featuring  the 
local  experiment.  It  behaves  like  a  noise.  We  call  it  as  a 
global  noise. 

It  may  not  be  easy  to  detect  a  global  noise  before  es¬ 
timating  the  density.  There  may  not  be  enough  global 
noises  to  reduce  the  dimensionality  to  a  manageable  level 
too.  However,  a  variable  may  be  important  in  some  sub¬ 
spaces  and  completely  have  no  influence  in  others.  We 
call  it  as  a  local  noise.  Being  able  to  locate  these  sub¬ 
spaces  and  to  identify  the  local  noises  in  each  of  the 
subspaces  will  greatly  reduce  the  dimensionality  of  our 
problem. 

Notice  that  if  the  underlying  density  is  a  smooth  func¬ 
tion  and  if  one  subspace  is  sufficiently  small,  the  real  den¬ 
sity  in  the  subspace  can  be  approximated  by  a  constant. 
In  this  subspace  all  variables  can  be  considered  as  local 
noises.  Therefore  if  we  can  find  a  way  to  partition  the 
whole  space  into  some  subspaces  such  that  the  density 
is  a  constant  (approximately)  in  each  of  the  subspaces, 
then  all  features  of  the  density  will  be  summarized  (ap- 
proximatedly)  by  the  partition  process. 

Here  comes  our  approach  of  estimating  high- 
dimension  density:  through  identifying  feature  variables 
and  noise  variables,  we  partition  the  data  space  with  the 
feature  variables  until  all  variables  are  local  noises.  The 
density  can  be  estimated  immediately  by  the  constants 
in  the  subspaces  and  the  density  structure  is  summarized 
by  the  partition.  The  approach  needs  to  be  insensitive 
to  noises.  It  is  also  necessary  to  organize  the  partition 
into  a  simple,  understandable  format. 
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3  Tree-Structured  Density  Esti¬ 
mation 

The  CART  tree  methodology  has  the  features  we  desired 
at  the  end  of  last  section.  Some  of  its  important  ideas 
will  be  briefed  in  subsection  3.1.  A  few  serious  prob¬ 
lems  exist  if  we  apply  the  method  to  density  estimation 
directly.  We  will  discuss  these  problems  and  their  solu¬ 
tions  in  subsection  3.2  and  3.3.  Simulation  results  are 
presented  in  subsection  3.4. 

3.1  Tree-Structured  Methodology 

The  procedure  of  applying  the  tree-structured  method¬ 
ology  to  a  statistics  problem  includes  the  following  steps: 
defining  problem;  defining  splits;  tree  growing;  pruning; 
and  tree  selection. 

First  we  need  to  have  a  measure  of  lack  of  accuracy.  It 
will  decrease  as  the  partition  gets  finer.  We  also  need  to 
define  a  pool  of  splits.  The  splitting  rule  decides  which 
split  should  be  selected  from  the  pool.  The  tree  will  keep 
on  growing  until  some  stopping  rules  are  reached.  This 
results  in  a  big  tree. 

The  essential  idea  in  CART  is  in  the  pruning  and  tree 
selection  process.  It  adapts  an  idea  from  variable  se¬ 
lection  in  regression  analysis.  The  idea  selects  the  best 
dimensionality  rather  than  the  best  combination  of  vari¬ 
ables  as  the  first  is  more  stable  than  the  second  from 
sample  to  sample.  In  CART,  the  dimensionality  is  the 
tree  size  (number  of  terminal  nodes).  The  pruning  al¬ 
gorithm  produces  a  ^^best”  subtree  for  each  given  tree 
size.  This  “best”  subtree  has  the  smallest  estimation  er¬ 
ror  among  all  possible  subtrees  with  the  same  tree  size. 
After  the  “best”  tree  list  is  established,  an  independent 
test  data  set  or  cross-validation  will  be  applied  to  select 
the  “best”  tree  size.  Then  the  method  produces  the  final 
“best”  tree. 

A  tree  optimizing  approach  was  developed  by  Shang 
[1993],  Breiman  and  Shang  [1994]  to  make  the  “best” 
trees  list  even  better.  It  tries  to  adjust  the  existing  splits 
to  nullify  the  effects  made  by  the  “greedy”  nature  of 
CART  stepwise  procedure. 


random  points  Xi ,  ^2, . . . ,  have  been  sampled  from 
some  underlying  density  /  defined  on  5. 

Suppose  S  has  been  partitioned  into  m  subspaces: 

m 

S=[JSi  (5) 

by  some  partition  r.  Each  Si  has  volume  K*.  t  also 
divides  the  n  data  points  into  m  parts.  Let  n,-  be  the 
number  of  data  points  in  Si.  Then  the  density  can  be 
estimated  by: 

f(x)  =  di  =  ^  ifxeSi  (6) 

For  this  estimator  /,  the  last  two  items  of  the  inte¬ 
grated  square  error  /(/  —  /)^  can  be  calculated  easily: 

=  =  (7) 

If  one  of  the  Vi  is  very  small,  yet  Si  still  contains  at 
least  one  data  point,  then  I(t)  will  be  very  small.  If  we 
use  number  of  subspaces  to  build  the  “best”  tree  list, 
then  rough  estimations  will  be  selected  no  matter  what 
the  real  density  is.  A  better  smoothness  measure  should 
consider  both  the  fineness  (number  of  subspaces)  and  the 
evenness  together. 

We  define  a  roughness  parameter  as  the  harmonic  av¬ 
erage  of  the  volumes  of  the  subspaces: 


It  is  easy  to  show  that:  R{t)  >  m  and  iZ(r)  =  m  if  and 
only  if  r  is  the  even  partition,  i.e.  Vi  —  1/m.  In  general, 
the  larger  the  m  and/or  the  more  uneven  of  the  r,  the 
larger  the  i2(r).  These  are  the  properties  we  desired. 
We  also  call  P{t)  =  m  ♦  R[t)  as  the  penalty  function  of 
r. 

Since  R{t)  takes  continuous  values,  it  is  necessary  to 
make  it  discrete.  For  a  positive  number  q,  we  define: 

C{k)  =  {T  :k-0.b<qR{T)<  ib-fO.5}  *  =  1,2,...  (9) 


3.2  Roughness  Parameter 

For  density  estimation,  a  measure  of  accuracy  is  the 
mean  integrated  square  error. 


All  r  in  C{k)  will  be  considered  to  have  same  roughness 
parameter  k.  Here  q  is  a  scaling  factor,  normally  we  take 
it  as  1. 

Another  possible  roughness  measure  is: 


MISE  =  Ej{f~ff  (4) 


i2^(r)  = 


»=1  ' 


(10) 


Without  loss  of  generality,  we  assume  the  data  is  de¬ 
fined  on  the  p-dimension  unit  cubic:  S  =  (0,1)^.  n 


It  has  better  probability  interpretations  but  penalizes 
unevenness  less  than  R{t)  does. 
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3,3  Tree  Growing  and  Pruning 

Due  to  the  restriction  on  paper  length,  here  we  will  only 
discuss  three  issues  which  is  more  essential  to  density 
estimation. 


Pool  of  Splits 

Since  the  volumes  of  subspaces  influence  7(r)  too, 
there  are  essentially  infinite  ways  to  make  a  split.  Our 
splits  will  be  selected  from  the  following: 

E  =  {flT  :  <T  =  (xf  <  j/G)  l<i<P\l<3  <G)  (11) 

Here  G  is  an  positive  integer.  We  usually  take  it  as  100. 


Splitting  Rules 


Suppose  current  subspace  is  So  and  <7  E  E  split  So  into 
S\  and  Sr .  Let  no ,  n/,  Ur  and  Vb ,  Vi,  Vr  be  the  correspond¬ 
ing  number  of  data  points  and  the  volumes.  Then  the 
integrated  square  error  defined  in  (7)  will  be  decreased 
by: 


AJ(<r)  = 


n^VoViVr^Vo 


no' 


(12) 


Maximizing  AZ(£r)  favors  splits  close  to  the  boundaries 
of  So,  thus  makes  the  estimation  rough.  At  the  same 
time,  the  penalty  function  P  increases: 


AP(<7)  = 


Vq^  -  ViVr 

VoVlVr 


(13) 


a  better  splitting  rule  would  be:  AI{(T)/AP{a).  We 
actually  use  a  simpler  version: 


Maximizing  D{(7)  is  maximizing  the  distance  between 
the  empirical  c.d.f  of  samples  in  subspace  5o  and  the  c.d.f 
of  the  local  uniform  density.  As  we  are  going  to  apply 
the  tree  optimizing  procedures,  this  initial  selection  is 
not  quite  essential. 


Tree  Pruning 


The  roughness  parameter  i2(r)  defined  in  (8)  will  be 
used  to  make  the  ‘‘best”  tree  list.  For  each  positive  in¬ 
teger  Jk,  we  will  try  to  find  the  subtree  r*  with  smallest 
I{t)  among  all  subtrees  in  C{k),  A  better  definition  of 
the  “best”  is: 


I  in)  _ 


mm 


J(r) 


R{Tk)  Tec(fc)  Ji(r) 


(15) 


Ck  is  defined  as  in  (9). 

The  roughness  parameter  R{t)  depends  on  the  num¬ 
ber  of  terminal  nodes.  It  is  impossible  to  generate  the 


subtree  in  (15)  directly.  We  make  the  list  through  a 
two-step  approach.  First,  we  build  the  “best”  tree  list 
according  to  the  penalty  function  P(r)  although  still  us¬ 
ing  the  criterion  in  (15).  Then  the  “best”  tree  list  for 
R{t)  can  be  obtained  easily.  Here  P(r)  needs  to  be  made 
discrete  as  in  (9). 

4  Simulations 

Extensive  simulations  have  been  made  to  study  the  per¬ 
formance  of  the  tree  method.  Two  aspects  need  to  be 
examined.  The  first,  how  the  method  captures  the  fea¬ 
tures.  The  second,  how  it  responses  to  the  influences  of 
noises.  Both  contributes  to  the  accuracy  of  estimation. 
In  this  paper,  only  the  results  of  second  aspect  will  be 
presented  due  to  the  limitation  of  paper  length. 

Three  types  of  simulations  are  designed.  Each  of  the 
underlying  densities  is  a  random  composition  of  four  sim¬ 
pler  densities:  f{x)  =  c»</t(iP)-  Here  (01,02,03,04) 

is  a  random  point  from  the  four  dimension  simplex.  f{x) 
is  restricted  in  the  unit  hyper-cubic  (0, 1)^. 

For  the  first  type  of  simulation,  each  gi{x)  is  a  five 
dimensional  normal  densities:  /i*  comes  from 

uniform  distribution  (7(1/6, 5/6)^.  E*  is  a  diagonal  ma¬ 
trix,  the  diagonal  elements  are  from  (7(0, 5/9).  For  each 
data  set  generated  from  the  first  type  of  simulation,  five 
global  noises  are  added  to  make  the  gi{x)  of  the  the  sec¬ 
ond  type  of  simulation.  The  noises  are  from  (7(0, 1)®. 

We  consider  local  noises  in  the  third  type  of  simular 
tion.  Each  of  the  gi{x)  contains  five  feature  variables  and 
five  noise  variables.  The  noises  are  from  (7 (0, 1)^  and  the 
features  are  from  a  five  dimensional  normal  densities  as 
in  the  first  type  of  simulation.  Whether  a  variable  is 
noise  or  not  is  randomly  decided. 

Each  type  of  simulation  is  repeated  eight  times.  As 
there  are  so  many  random  factors  in  the  design,  they  are 
essentially  24  different  simulations.  The  sample  size  is 
500.  A  Monte  Carlo  sample  with  3,000  points  is  used  to 
calculate  the  integrated  square  loss  /(/  —  f)^-  The  re¬ 
sults  are  compared  with  results  from  kernel  estimation. 
In  the  kernel  estimation,  different  window  sizes  are  used 
for  different  variables.  The  best  window  sizes  are  se¬ 
lected  through  a  grid  search.  The  window  sizes  change 
from  0  to  1  at  a  step  of  0.02.  The  simulation  results  are 
presented  in  Table  1. 

In  Design  1,  there  is  no  designed  noises.  So  Both 
methods  are  trying  to  capture  the  features.  Although 
the  tree  estimation  uses  a  jumped  step  function  to  es¬ 
timate  a  smooth  function,  its  accuracy  is  comparable 
with  kernel’s.  In  two  of  the  situations,  it  is  even  bet¬ 
ter.  However  when  global  noises  are  added  in  Design  2, 
the  efficiency  of  kernel  estimation  reduces  dramatically 
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Simulation 

Design  1 

Design  2 

Design  3 

kernel 

tree 

kernel 

tree 

1 

3.145 

1.309 

2.830 

0.807 

2 

5.986 

19.000 

6.102 

5.086 

3.050 

3 

0.816 

1.864 

0.852 

4 

1.635 

0.782 

5 

0.927 

2.768 

0.886 

1.829 

1.082 

6 

2.195 

3.969 

2.195 

7 

4.797 

14.125 

7.475 

1.928 

1.132 

8 

0.633 

2.014 

1.115 

Table  1:  Comparing  Tree-based  Estimation  and  Kernel  Estimation 


while  there  is  hardly  any  changes  in  tree  estimation.  In 
Design  3,  a  variable  could  be  a  feature  here  and  a  noise 
there.  Still  tree  estimation  is  much  better. 

5  Summary  and  Conclusions 

The  information  carried  by  a  density  function  is  how  the 
density  is  distributed  or  how  it  is  different  from  a  uniform 
distribution.  If  a  variable  is  uniformly  distributed  and  it 
is  independent  with  the  others,  it  should  be  considered 
as  a  noise.  In  a  high- dimension  situation,  many  vari¬ 
ables  may  behave  like  local  noises:  they  are  condition¬ 
ally  independent  with  other  variables  and  are  uniformly 
distributed  in  some  subspaces.  A  good  density  estima¬ 
tion  method  should  be  able  to  identify  these  noises  while 
capturing  the  real  data  features. 

The  tree-structured  density  estimation  has  the  power 
to  identify  these  noises  locally.  The  method  is  based 
on  the  CART  tree  methodology.  However  significant 
changes  have  been  made  to  adapt  the  methodology  into 
density  estimation.  These  include  defining  a  roughness 
parameter  and  applying  a  tree  optimizing  approach.  The 
method  is  compared  with  kernel  method  through  simu¬ 
lations.  The  simulations  presented  in  this  paper  concern 
only  the  influence  of  noises. 

More  simulations  are  made  to  study  how  the  method 
captures  the  features  of  density  functions  in  one  or  two 
dimension  situations.  Although  the  underlying  densi¬ 
ties  are  continuous,  the  tree  method  still  have  compara¬ 
ble  performance  when  it  is  compared  with  other  regular 
methods.  Actually  it  is  often  more  accurate  when  the 
underlying  density  is  complex.  These  results  will  be  pre¬ 
sented  in  a  more  complete  paper. 


Spear  for  his  constant  support  and  encouragements.  His 
comments  and  insights  from  application  perspective  are 
very  enlightening  and  valuable. 
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Abstract:  Rankings  of  n  items  are  often  constructed 
from  paired  comparisons  within  the  set  (football  rank¬ 
ings,  for  example).  Collections  of  paired  comparisons, 
however,  are  subject  to  inconsistencies:  A  >  B, 

B  >  Cy  C  >  A.  Treating  the  set  of  all  possible  out¬ 
comes  of  the  (2)  paired  comparisons  of  n  items  as  ver¬ 
tices  on  a  hypercube,  inconsistencies  can  be  removed  by 
orthogonal  decomposition.  The  resulting  consistent  sub¬ 
structures  lie  imbedded  in  permutation  polytopes. 

1.  Introduction 

Given  a  set  of  n  items,  it  is  often  desirable  to  be 
able  to  assign  a  ranking  to  the  set  of  items,  indicating 
first  place,  second  place,  and  so  on.  A  situation  that 
often  arises  is  that  the  rankings  of  n  items  are  not  given 
per  se;  rather,  the  data  consist  of  the  outcomes  from 
some  subset  of  the  (”)  possible  paired  comparisons,  from 
which  a  ranking  must  be  inferred.  For  example,  if  the 
three  items  A,B,C  were  to  be  compared  in  pairs,  the 
collection  of  outcomes  A  >  B,  A  >  C  and  B  >  C  would 
imply  that  item  A  should  be  ranked  first,  item  B  second, 
and  item  C  third  (here  and  in  the  sequel,  A>  B  means 
that  item  A  is  preferred  to  item  B). 

A  common  problem  that  arises  with  paired  compar¬ 
isons  is  that  some  collections  of  outcomes  are  internally 
inconsistent,  e,g,  A  >  B  >  and  C  >  A.  These  in¬ 
consistent  triples  were  denoted  circular  triads  by  Kendall 
and  Babington  Smith  (1940).  How  to  deal  with  these 
inconsistent  groupings  is  a  matter  of  some  dispute.  In 
most  formulations  of  ranking  models,  inconsistent  col¬ 
lections  of  outcomes  are  simply  ignored.  However,  some 
information  can  occasionally  be  gleaned  from  inconsis¬ 
tent  collections.  Given  four  items  and  the  collection  of 
outcomes  A>B,  A>C,  A>JD  and  B  >  C,  C  >  H, 
B  >  B,  it  can  be  asserted  that  item  A  is  preferred  to  the 
other  three,  even  though  the  relative  ordering  amongst 
items  B,  C  and  D  is  not  discernible. 

2.  Permutation  Poly  topes  as  Projections  of  Hy¬ 
percubes 

When  a  comparison  of  two  items  is  conducted,  the 
outcome  is  binary  either  item  A  is  preferred  to  item 
B,  or  item  B  is  preferred  to  item  A.  Each  paired  com¬ 
parison  can  thus  be  thought  of  geometrically  as  defining 
an  axis,  for  example  the  AB  axis,  where  the  particular 
outcome  determines  the  value  along  that  axis:  1  if  (using 
the  example  above)  A  is  preferred  to  B,  and  —1  {—AB) 


if  B  is  preferred  to  A.  In  this  manner,  a  space  containing 
the  overall  structure  arising  from  a  collection  of  paired 
comparisons  can  be  specified  by  the  cartesian  product  of 
these  axes.  The  collections  of  outcomes  of  the  (”)  paired 
comparisons  arising  from  the  possible  pairings  of  2  out  of 
n  items  can  be  viewed  as  vertices  on  an  (2) -dimensional 
hypercube,  with  coordinates  of  either  1  or  —1.  These 
vertices  can  then  be  thought  of  as  points  in  As  an 

example,  consider  the  collections  of  paired  comparisons 
possible  among  three  items,  A,B,C.  If  (for  purposes 
of  orientation)  the  axes  defined  are  taken  to  correspond 
to  AB,  AC  and  BC,  respectively,  then  (1, 1, 1)  would 
indicate  A  >  B,  A  >  C,  and  B  >  C,  (—1,-1, 1)  would 
indicate  A  <  B,  A  <  C  and  B  >  C,  and  so  on.  The 
eight  possible  collections  correspond  to  the  vertices  of  a 
cube  in  three  dimensions. 


To  address  the  problem  of  inconsistent  sets  of  com¬ 
parisons,  consider  the  cube  defined  by  the  collections  of 
paired  comparisons  of  three  items.  Of  the  eight  vertices, 
two  correspond  to  triples  which  are  linearly  inconsistent: 
(1,  -1, 1)  and  (-1, 1,  -1),  using  the  AB-AC-BC  coordi¬ 
nate  system  as  before.  The  other  six  vertices  correspond 
to  the  six  possible  rankings  of  three  items.  The  two  in¬ 
consistent  triples  both  lie  along  a  single  vector  through 

the  origin,  ABC  =  (1,-1, 1).  (In  the  sequel,  the  nota¬ 
tion  UK  will  be  used  to  indicate  the  vector  correspond¬ 
ing  to  an  inconsistent  arrangement  of  the  three  arbitrary 
items  J,  J,  iir ,  specifically  I  >  J ,  J  '>  K  and  K  >  J). 
This  vector  defines  an  ‘‘inconsistent  subspace”  associ¬ 
ated  with  this  set  of  paired  comparisons;  this  in  turn 
suggests  the  existence  of  a  “consistent  subspace”.  The 
linear  (ranking)  information  present  within  a  collection 
of  paired  comparisons  can  be  viewed  as  a  function  of  the 
projection  of  the  vector  associated  with  that  particular 
vertex  of  the  hypercube  onto  the  consistent  subspace. 
This  projection  is  illustrated  in  Figure  1. 

To  establish  the  general  procedure,  we  need  to  show 
that  a  decomposition  into  inconsistent  and  consistent 
subspaces  is  always  feasible. 

As  a  first  step,  note  that  any  triplet  of  items  can  give 
rise  to  inconsistent  pairings.  There  are  (3)  item  triplets, 
defining  an  equal  number  of  vectors  corresponding  to  in¬ 
consistencies.  These  vectors  must  be  linearly  dependent, 
as  (3)  grows  faster  than  (”)  (the  dimension  of  the  hyper¬ 
cube).  Thus,  it  is  necessary  to  establish  the  dimension 
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Figure  1;  Cube  of  Paired  Comparisons  of  Three  Items, 
and  the  Associated  Projection  onto  the  Consistent 
Subspace,  Inconsistent  Triples  are  Indicated  by  Dcirk 
Circles.  Coordinates  are  in  the  (AB,AC^BC)  System. 

of  the  space  spanned  by  the  vectors  corresponding  to 
these  inconsistent  triples. 

Consider  the  vectors  corresponding  to  the  inconsis¬ 
tent  triples  arising  from  the  comparisons  of  four  items. 
These  are  shown  in  the  rows  of  Table  1.  An  entry  of  1  in 
the  table  indicates  that  the  specified  pair  was  preferred 
in  the  given  order,  a  —1  indicates  that  the  pair  was  pre¬ 
ferred  in  the  reverse  order,  and  a  0  indicates  that  no 

direct  pairing  has  occurred.  As  Bc3  =  ABC  —  ABD  -h 

Ac3,  the  vectors  are  linearly  dependent.  However,  if 
attention  is  constrained  solely  to  the  triples  containing 
a  specific  item  {e.g.  A),  those  vectors  are  not  linearly 
dependent. 

Lemma  1:  The  vectors  corresponding  to  inconsistent 
triples  involving  item  A  are  linearly  independent. 

Proof:  For  any  items  I  and  J,  the  vector  AlJ  has  a  1  in 
the  entry  corresponding  to  the  IJ  axis,  and  the  vectors 
corresponding  to  all  other  inconsistent  triples  containing 

A  have  a  0.  Thus,  AIJ  cannot  be  formed  as  a  linear 


Table  1:  Vectors  Associated  with  Inconsistent  Triples 
Arising  from  Paired  Comparisons  of  Four  Items. 

Axis  Labels 


AB 

AC 

BC 

AD 

BD 

CD 

Inconsistent 

ABd 

1 

-1 

1 

0 

0 

0 

Triples: 

ab3 

1 

0 

0 

-1 

1 

0 

ASd  points  to 

ACi 

0 

1 

0 

-1 

0 

1 

A>B,B>C,C>A 

Bcit 

0 

0 

1 

0 

-1 

1 

combination  of  such  vectors. 

Lemma  2;  Any  vector  corresponding  to  an  inconsistent 
triple  can  be  expressed  as  a  linear  combination  of  vectors 
corresponding  to  inconsistent  triples  involving  item  A. 

Proof:  As  the  lemma  is  trivially  true  if  the  inconsistent 
triple  involves  A,  it  suffices  to  show  that  it  holds  for  an 
arbitrary  inconsistent  triple  /,  J,  K  not  involving  A.  This 
is  most  easily  shown  using  unit  vectors  corresponding  to 

the  various  paired  comparison  axes;  e.y.,  AB(i  =  cab  — 
IJK 

=  eij  -  eiK  +  ejK 
=  eij  -  ejK  + 

{CAI  -  ^4/)  +  {^AJ  -  eAj)  +  {cak  -  ^ak) 

=  {^AI  -  ^AJ  +  ^//)  —  {^AI  —  CaK  +  eiK)+ 
i^AJ  -  ^AK  +  Cjk) 

=  AlJ -AI^  +  AJ^. 

Lemma  3;  Any  vector  corresponding  to  an  inconsistent 

fc-tuple  (for  example,  ABCD  =  eab  -\-^cd  —^ad 

corresponds  to  an  inconsistent  4- tuple)  can  be  written 
as  a  linear  combination  of  vectors  corresponding  to  in¬ 
consistent  triples. 

Proof:  The  lemma  holds  trivially  if  ib  =  3,  and  incon¬ 
sistency  is  impossible  if  Ar  <  3,  so  the  lemma  holds  then 
as  well.  If  Ar  >  3,  the  vector  corresponding  to  the  in¬ 
consistent  Ar- tuple  can  be  written  as  the  sum  of  a  vector 
corresponding  to  an  inconsistent  triple  and  a  vector  cor¬ 
responding  to  an  inconsistent  (A:  —  1)- tuple  as  follows: 

TJklTTTn 

=  eij  +  €jk  +  €kl  +  . . .  -  ejN 
=  i^iK  -  ^ik)  +  eij  -h  €JK  +  CKL  +  . . .  -  ejN 
=  (eij  -  eiK  +  cjk)  +  ejK  +  ckl  +  . . .  -  c/at 

=  IJ^  +  IKL...N. 

The  (A:  —  l)-tuple  can  then  be  reduced,  and  the  lemma 
follows  by  induction. 
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Theorem  1:  Given  the  space  defined  by  the  (j)  axes 
associated  with  the  paired  comparisons  possible  among 
n  items,  the  set  of  vectors  corresponding  to  inconsistent 
triples  involving  item  A  forms  a  basis  for  the  inconsistent 
subspace. 

Proof:  The  theorem  follows  immediately  from  Lemmas 
1>3. 

Corollary  1.1:  The  dimension  of  the  inconsistent  sub¬ 
space  is  (”2^)- 

Corollary  1.2:  The  consistent  subspace  exists  and  has 
dimension  (j)  —  (”2  =  n  —  1. 

Further,  it  can  be  shown  (cf.  Baggerly  (1994))  that 

if  II  is  the  vector  corresponding  to  item  I  being  ranked 
first  (item  I  beats  all  other  items,  and  pairings  of  items 
not  including  I  do  not  occur)  then  the  collection  of  vec¬ 
tors  {(Jl  4"  otAl)/^/n}y  where  I  is  any  item  other  than 
A  and  a  =  ,  forms  an  orthonormal  basis  for  the 

consistent  subspace. 

Putting  this  basis  into  use,  the  projection  of  the  six¬ 
dimensional  hypercube  arising  from  the  pairwise  com¬ 
parisons  of  4  distinct  items  onto  the  corresponding  3- 
dimensional  consistent  subspace  is  shown  in  Figure  2. 
The  completely  consistent  sets  of  paired  comparisons 
correspond  to  rankings  of  the  four  items;  these  are  situ¬ 
ated  at  the  vertices  of  the  resultant  polytope.  In  terms 
of  cartesian  coordinates,  these  vertices  lie  at  the  24  per¬ 
mutations  of  (0,  dbl,±2).  Those  sets  of  paired  compar¬ 
isons  with  only  one  inconsistent  triple  (slightly  inconsis¬ 
tent)  are  situated  at  the  centers  of  the  hexagonal  faces 
of  the  resultant  polytope;  these  each  have  multiplicity  2 
(i.e.,  two  vertices  of  the  initial  hypercube  map  to  each 
such  point).  In  terms  of  cartesian  coordinates,  these 
vertices  lie  at  the  8  permutations  of  (±1,±1,±1).  Fi¬ 
nally,  those  sets  of  paired  comparisons  with  two  incon¬ 
sistent  triples  (grossly  inconsistent)  are  situated  behind 
the  square  faces  of  the  resultant  polytope;  these  each 
have  multiplicity  4.  In  terms  of  cartesian  coordinates, 
these  vertices  lie  at  the  6  permutations  of  (0, 0,4:1).  It 
is  impossible  to  have  more  than  two  inconsistent  triples 
arising  in  the  paired  comparisons  of  4  items.  The  edges 
defining  the  convex  hull  of  these  points  are  also  shown. 

Several  features  of  this  figure  should  be  noted.  First, 
each  edge  of  the  initial  hypercube  has  been  shortened  by 
the  same  amount.  This  equal  shortening  follows  from 
the  fact  that  each  edge  of  the  hypercube  corresponds 
to  a  shift  along  a  single  paired  comparison  axis  and  the 
projection  acts  upon  the  axes  in  a  symmetric  manner. 
Thus,  the  vertex  labelled  CBAD  is  the  same  distance 
from  the  circled  points  as  it  is  from  the  vertex  labelled 
CBDA,  Second,  the  projections  of  the  hypercube  ver¬ 


tices  corresponding  to  full  rankings  of  the  items  define 
the  convex  hull  of  the  projection  of  the  hypercube  onto 
the  consistent  subspace.  The  polytope  thus  defined  is  a 
truncated  octahedron,  having  eight  hexagonal  faces  and 
six  square  faces,  and  is  equivalent  to  the  permutation 
polytope  associated  with  the  rankings  of  four  items. 

3.  Permutation  Polytopes 

A  permutation  polytope  is  the  convex  hull  defined 
by  the  n!  points  in  whose  coordinates  are  permuta¬ 
tions  of  the  first  n  integers.  Using  these  polytopes  in 
to  analyze  ranked  data  was  first  suggested  by  Schulman 
(1979);  this  procedure  has  recently  been  generalized  and 
expanded  on  by  Thompson  (1993). 

Consider  the  rankings  of  four  items.  A,  B,  C,  D,  and 
let  be  a  vector  in  3?^  whose  coordinates  are  the  ranks 
of  A,jB,C,jD,  respectively.  Thus,  tt,-  =  (1,2, 3, 4)  would 
correspond  to  the  ordering  (A,  B,  C,  £>};  item  A  is  ranked 
first,  item  B  second,  item  C  third,  and  item  D  fourth. 
Similarly,  ifi  =  (3, 4, 1, 2)  would  correspond  to  the  order¬ 
ing  {CjD,A,B)\  item  A  is  ranked  third,  item  B  fourth, 
item  C  first,  and  item  D  second. 

As  each  vector  has  the  same  components, 

4 

X)^<0')  =  1  +  2  +  3  +  4  =  10 

}-i 

a  constant,  so  this  poly  tope  is  constrained  to  lie  in  a  3- 
dimensional  subspace  of  Similarly,  as  the  average  of 
the  components,  is  always  the  same, 

4 

-Wif  =  2.25  +  .25  +  .25  +  2.25  =  5, 

another  constant,  so  this  poly  tope  is  constrained  to  lie 
on  a  4-dimensional  hypersphere.  These  constraints  gen¬ 
eralize  to  n  dimensions,  so  a  permutation  poly  tope  must 
lie  imbedded  in  an  (n  —  l)-dimensional  hypersphere. 

Two  vertices  of  the  permutation  poly  tope  are  joined 
by  an  edge  if  and  only  if  they  differ  by  a  single  transpo¬ 
sition  of  two  consecutive  integers:  (3, 4, 1, 2)  is  joined  to 
(3, 4, 2, 1),  (2, 4, 1, 3)  and  (4, 3, 1, 2).  In  terms  of  item  or¬ 
derings,  this  transposition  of  consecutive  integers  corre¬ 
sponds  to  a  transposition  of  two  adjacent  items: 
(C,  B,A,  B)  is  joined  to  (B,  C,  A,  B),  (C,A,  B,  B),  and 
(C,B,B,A). 

The  connection  structure  imparts  a  powerful  prop¬ 
erty  to  the  permutation  polytopes:  every  face  on  the 
polytope  has  a  direct  interpretation  in  terms  of  rank¬ 
ings. 

This  feature  is  illustrated  in  Figure  2,  where  eight  of 
the  vertices  have  been  labelled  with  their  corresponding 
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Figure  2:  Projection  of  the  6-dimensional  Hypercube  Arising  from  the  Paired 
Comparisons  of  Four  Items  onto  the  Consistent  Subspace. 


item  orderings.  The  labelled  hexagon  in  the  upper  left 
corresponds  to  the  collection  of  all  rankings  in  which 
item  C  is  ranked  first.  Similarly,  three  of  the  remaining 
hexagonal  faces  correspond  to  items  A,  B  and  D  being 
ranked  first,  respectively.  The  other  four  hexagonal  faces 
correspond  to  a  given  item  being  ranked  last.  The  six 
square  faces  correspond  to  a  given  pair  of  items  being 
ranked  in  the  first  two  positions;  the  labelled  square  in 
front  corresponds  to  items  B  and  C  being  jointly  ranked 
first  and  second.  Methods  of  determining  what  faces  can 
arise  and  assigning  interpretations  to  them  are  provided 
in  Thompson  (1993)  and  Baggerly  (1994). 

Hence,  the  ranking  information  present  in  a  collec¬ 
tion  of  paired  comparisons  can  be  inferred  by  noting 
what  face  of  the  permutation  polytope  is  indicated  by 
the  projection  of  the  collection  onto  the  consistent  sub¬ 


space.  The  inconsistent  collections  mapping  to  the  cir¬ 
cled  dot  at  the  center  of  the  hexagonal  face  in  the  upper 
left  of  Figure  2  correspond  to  item  C  being  ranked  first, 
while  the  relative  ranking  of  items  A,  B  and  D  is  left 
indeterminate.  Similarly,  the  large  circled  dot  behind 
the  square  face  in  the  center  indicates  a  slight  preference 
for  items  B  and  C  over  items  A  and  D,  The  fact  that 
the  vertex  is  within  the  convex  hull  (as  opposed  to  on 
the  surface)  indicates  that  some  ambiguity  is  present.  In 
general,  the  magnitude  of  the  projection  onto  the  consis¬ 
tent  subspace  can  be  taken  as  an  indicator  of  the  strength 
of  the  expressed  preference. 

4,  Future  Work 

Several  avenues  remain  to  be  explored.  There  are 
questions  of  what  to  do  if  the  collections  of  paired  com¬ 
parisons  are  incomplete  (in  that  not  every  comparison 


has  been  made),  or  if  some  comparisons  have  been  made 
multiple  times.  It  is  not  immediately  clear  how  to  scale 
the  projections  to  account  for  the  missing  information. 
These  problems  have  been  addressed  analytically  by 
Kendall  (1955)  and  recently  by  Andrews  and  David 
(1990).  A  good  overview  of  much  of  the  analytic  work 
done  on  paired  comparisons  is  provided  by  David  (1988). 

If  all  comparisons  have  been  made,  but  some  ties 
have  resulted  (yielding  a  0  as  opposed  to  a  1  or  —1  in 
the  appropriate  entry),  the  projection  onto  the  consis¬ 
tent  subspace  is  still  well-defined.  Every  full  ranking  can 
be  represented  as  a  collection  of  paired  comparisons;  if 
ties  are  allowed,  every  partial  ranking  (e.^^.,  A  first)  can 
also  be  represented  by  a  collection.  This  fact  suggests 
new  geometric  ways  of  examining  mixtures  of  full  and 
partially  ranked  data. 

Finally,  there  may  exist  other  projections  of  the 
paired  comparison  hypercubes  which  may  be  of  interest, 
as  these  other  projections  can  potentially  reveal  trends 
in  exactly  how  inconsistencies  tend  to  occur. 

References 

Andrews,  D.  M.  and  David,  H.  A.  (1990)  ‘‘Nonparamet- 
ric  Analysis  of  Unbalanced  Paired-Comparison  or 
Ranked  Data”,  Journal  of  the  American  Statistical  As¬ 
sociation^  85,  1140-1146. 

Baggerly,  K.  A.  (1994),  Visual  Estimation  of  Structure 
in  Ranked  Data^  Ph.D.  Thesis,  Department  of  Statistics, 
Rice  University. 

David,  H.  A.  (1988),  The  Method  of  Paired  Comparisons^ 
Charles  Griffin  h  Company,  Ltd.,  London;  Oxford  Uni¬ 
versity  Press,  New  York. 

Kendall,  M.  G.  (1955),  “Further  Contributions  to  the 
Theory  of  Paired  Comparisons”,  Biometrics,  11,  43-62. 

Kendall,  M.  G.,  and  Babington  Smith,  B.  (1940),  “On 
the  Method  of  Paired  Comparisons”,  Biometrika,  31, 
324-345. 

Schulman,  R.  S.  (1979),  “A  Geometric  Model  of  Rank 
Correlation”,  The  American  Statistician,  33,  77-80. 

Thompson,  G.  L.  (1993),  “Generalized  Permutation 
Polytopes  and  Exploratory  Graphical  Methods  for 
Ranked  Data”,  Annals  of  Statistics,  21,  1401-1430. 


182  Analysis  of  Multiple  Correlated  Binary  Endpoints 


On  the  Analysis  of  Multiple  Correlated  Binary  Endpoints  in  Medical  Studies 

Chung-Kuei  Chang  and  Dror  M.  Rom 

Department  of  Biostatistics,  Rhone-Poulenc  Rorer  Central  Research, 
Collegeville,  PA  19426,  U.S.A. 


ABSTRACT 

A  new  procedure  is  proposed  for  the  analysis  of  multiple 
correlated  binary  endpoints.  The  procedure  is  based  on  the 
exact  distribution  of  -2Sjlog(pj),  where  pj's  are 
transformations  of  the  statistics  z,’s  used  to  test  the 
individual  endpoints.  We  show  how  to  make  global,  as 
well  as  local  inferences  regarding  the  hypotheses.  We  also 
compare  this  approach  with  several  recently  proposed 
multiple  comparison  procedures  for  the  analysis  of  multiple 
correlated  binary  endpoints  in  terms  of  Type-I  error  control, 
and  power. 

Key  words:  endpoints,  multiple  comparison  procedures, 
familywise  aror  rate. 

INTRODUCTION 

Suppose  there  are  c  treatment  groups,  a  control  group  and 
c-1  treated  groups  with  increasing  doses,  and  k  endpoints 
with  binary  response  were  measured  on  each  experimental 
unit.  For  endpoint  i,  i  =  1,  2,...,  k,  we  test  the  null 
hypothesis  H,-  of  no  treatment  effect,  against  the  alternative 
hypothesis  that  the  response  rate  increases  with  dose.  It  is 
well  known  that  the  Type  I  error  for  testing  Hq  = 
could  increase  if  we  conclude  that  there  is  a  treatment  effect 
by  rejecting  Hq  when  observing  any  significant  result 
among  the  k  endpoints.  Several  methods  are  available  in 
the  literature  to  control  the  overall  Type  I  error.  These 
include  the  Bonferroni  procedure  and  its  improvements, 
[see  Holm  (1979),  Simes  (1986),  Hommel  (1988),  Rom 
(1990)],  and  procedures  taking  discreteness  into  account 
[see  Brown  and  Fears  (1981),  Heyse  and  Rom  (1988), 
Westfall  and  Young  (1989),  Tarone  (1990),  Rom  (1992)]. 
In  this  p^er,  we  study  the  conditional  exact  test  of  Fisher's 
combination  procedure  using  T  =  -2Z,log(pj)  as  test 
statistic,  where  pj  is  the  asymptotic  p-value  for  testing  the  i- 
th  endpoint. 

NOTATION 

Suppose  that  there  is  a  total  of  r  (r  <  2^  )  different 
combinations  of  binary  responses  (response  vectors)  from 


the  k  endpoints,  denoted  by  D  =  l«^OTi]rxk  =  [dl>d2>-* 
d;.],  where  =  1,  if  the  i-th  endpoint  in  the  m-th 
combination  has  response;  else  dffii  =  0;  m  =  1,  2,...,  r,i  = 

1, 2,...,  k.  Let  =  [  n^j,  n^2 . “mcl  ^  number 

of  subjects  in  the  c  groups  corresponding  to  the  m-th 
response  vector  *  m  =  1,2,...,  r,  then  our  test  procedure 

is  based  on  G  =  [  I  Nq  p^^],  where  Nq  =  [  nj,  n2 . 

U;.]’.  Notice  that  the  m-th  row  margin  tif^.  =  Jfj_^ntnj  ** 

the  total  number  of  subjects  among  the  c  groups  that  have 
response  vector  d^,  and  the  y-th  column  margin  n.j  = 

^OT=l”ffy  group;;  m  =  l,  2,...,  r;j  =  1, 2 . 

c. 


ANALYSIS  OF  INDIVIDUAL  ENDPOINTS 

For  each  endpoint  i,  a  2  x  c  table  Ej  =  [  ej/y]  can  be  derived 
from  G,  where  the  cell  count  of  the  ;-th  column  in  the  first 
(  second)  row  is  the  number  of  subjects  that  do  not  respond 
( respond)  for  endpoint  i,  i.e., 
r 

Hlj=  ^  nmjk{i-\}idtn0 , 
m=l 

^{/-l}^^mi )  =  =  l-\, 

=  0,  otherwise; 

i=l,2 . )fe;/  =  l,2;;=l,2,...,c.  (1) 

The  analysis  for  endpoint  i  can  be  done  by  Mantel's  score 
test  using: 

2  k 

Ti  =  I  I«/.v  j-euj,  where  «;  =  /  - 1  and  vj  =j  -1.  (2) 
/=1;=1 

The  column  scores  vfs  reflect  the  progressive  response  at 
increasing  doses.  For  other  possible  values  of  the  scores, 
see  Tukey,  Ciminera  and  Heyse  (1985).  To  test  an  upward 
trend  in  the  response  rate,  the  asymptotic  p-value  is: 

Pi  =  1  -  0(Zi,,  where  Z; »  ~  ■  (3) 

fVarlTi)  ^  ' 
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OVERALL  TEST  PROCEDURE 

To  test  the  overall  null  hypothesis  Hq  of  no  treatment  effect 
at  any  endpoint,  we  propose  to  use: 

k 

r  =  -2Slog(/7,),  (4) 

i=l 

which  is  equivalent  to  using: 
k 

YiPi .  (5) 

/=i 

When  the  k  endpoints  are  continuous  and  independent,  T 
has  chi-square  distribution  with  2k  degrees  of  freedom,  and 
it  is  known  as  Fisher’s  combination  procedure.  In  our  case, 
the  overall  p-value  is  calculated  using  the  exact  distribution 
of  T\  conditional  on  the  row  and  column  margins  of  the 
observed  r  x  c  table  Nq.  It  is  the  probability  of  observing 
any  r  x  c  table  N,  under  the  null  hypothesis  and  conditional 
on  the  margins  of  Nq,  which  is  at  least  as  extreme  as  Nq, 
where  the  extremity  is  measured  by  T\  We  denote  the 
overall  p-value  by  adj-p,  and  express  it  as: 

adj- p  =  Pr(r <  r^^lmargins  of  Nq) 

=  XPr(Nlmarginsof  No)  ’ 

T<To 

where  Tq  is  the  observed  test  statistic.  Notice  that  the 
dependence  of  the  k  endpoints  is  reflected  by  the  row 
margins  and  that  N,  conditional  on  the  margins,  follows 
multivariate  hypergeometric  distribution  under  the  null 
hypothesis. 

Our  algorithm  is  as  follows: 

1.  Calculate  the  observed  statistic  Tq  =  njp,-  ,  using  D  and 
the  observed  table  N^. 

2.  Set  adj-p  to  zero. 

3.  Enumerate  r  x  c  tables  N  satisfying  the  row  and  column 
margins  of  Nq. 

4.  Use  D  and  the  enumerated  table  N  to  form  k2xc  tables 
and  calculate  their  asymptotic  p-values. 

5.  Calculate  the  corresponding  test  statistic  T  =  Ilj  Pi- 

6.  If  T  is  less  than  or  equal  to  T'^,  then  adj-p  is  increased  by 
p,  where  p  is  the  probabUity  of  observing  the  enumerated 
table  N,  conditional  on  the  margins. 

7.  Return  to  3,  until  all  possible  (given  the  margins)  rxc 
tables  are  enumerated. 


SOME  COMPUTING  ISSUES  IN  OUR  PROGRAM 

We  have  implemented  our  procedure  in  SAS  using 
complete  algorithm  (see  Veibeek  and  kroonenberg,  1985) 
to  enumerate  all  the  r  x  c  tables  satisfying  the  margins. 
When  large  problems  are  encountered,  we  use  Monte  Carlo 
simulation  to  estimate  the  adjusted  p-value  by  taking 
random  samples  from  all  possible  tables  (see  Boyett,  1979). 
Gail  and  Mantel’s  (1977)  method  to  approximate  the  total 
number  of  r  x  c  tables  satisfying  the  margins  can  help 
making  the  decision  of  selecting  the  exact  or  Monte  Carlo 
procedures. 


EXAMPLE  1 

The  following  example  is  from  Rom  (1992).  In  a 
carcinogenicity  study,  100  mice  were  randomly  assigned  to 
either  control  or  tested  groups,  with  50  mice  in  each  group. 
Tumor  incidence  at  site  A  and  B  were  observed  from  each 
mouse.  The  experimental  outcome  is  summarized  in  Table 
1,  where  D  is  the  4  x  2  table  under  Endpoint  1  and 
Endpoint  2,  and  Nq  is  the  4  x  2  table  under  Group  1  and 
Group  2.  From  TaWe  1,  we  can  derive  two  2x2  t^les  for 
endpoint  1  and  2,  and  calculate  the  observed  test  statistic 
Tq  ( see  Table  2).  Then,  we  start  enumerating  4x2  tables 
N  satisfying  the  margins  of  Nq.  From  each  enumerated 
table  N,  two  2x2  tables  are  derived  and  the  corresponding 
test  statistic  T  is  calculated.  If  T'  £  Tq,  the  overall  p-value 
adj-p  is  increased  by  the  probability  of  observing  N,  under 
the  null  hypothesis  and  conditional  on  the  margins  of  Nq. 
Part  of  the  SAS  output  is  shown  in  Table  3.  There  are  128 
tables  satisfying  the  margins,  and  the  overall  p-value  is 
about  0.0190. 

Note  that  for  this  example,  the  Fisher’s  Exact  test 
statistic  for  site  A  (B)  is  the  number  of  mice  in  the  treated 
group  with  tumor  A  (B),  Le.,  8  (5). 


Table  1.  Sununary  of  number  of  mice  with  tumors 


Tumor 

Endpoint  1  Endpoint  2 

Group  1 

Group  2 

Total 

site 

(Tumor  A)  ( Tumor B) 

(Control) 

(Treated) 

D 

N 

o 

None 

0 

0 

48 

39 

87 

A  only 

1 

0 

1 

6 

7 

B  only 

0 

1 

0 

3 

3 

AandB 

1 

1 

1 

2 

3 

Group 

size 

50 

50 

100 
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Table  2:  2  x  2  tables  for  A  and  B  generated  from  Table  1 


Tumor 

Incidence 

Control 

Treated 

Asymptotic 

p-value 

A 

No 

48 

42 

Yes 

2 

8 

0.02275 

B 

No 

49 

45 

Yes 

1 

5 

0.04606 

Tq  =  0.02275  X  0.04606  =  0.001048 


Table  3:  Part  of  SAS  output  for  Example  1 


Obs 

niin21»31 

Pi  P2  T'  p  adj-p 

A  B 

1 

50  0  0 

.0004 .0058 .0000  .00005  .00005 

10  6 

2 

49  1  0 

.0038 .0058  .0000 .00046 .00051 

9  6 

3 

49  0  1 

.0004 .0461 .0000  .00020 .00071 

10  5 

4 

49  0  0 

.0038  .0461 .0002  .00020 .00090 

9  5 

5 

48  2  0 

.0228 .0058  .0001 .00173  .00264 

8  6 

6 

48  1  1 

.0038 .0461 .0002 .00173  .00437 

9  5 

m 

38  6  3 

.9962 .9942 .9904 .00046 .01900 

1  0 

Q 

31  1  3 

.9996 .9942 .9938  .00005  .01900 

0  0 

To  =  0.001048 

Note: 

1.  pj  is  the  asymptotic  p-value  for  site  A. 

2.  p2  is  the  asymptotic  p-value  for  site  B. 

3.  p  is  the  probability  of  observing  N. 

4.  A  is  the  number  of  tumors  at  site  A  in  the  treated  group. 

5.  B  is  the  number  of  tumors  at  site  B  in  the  treated  group. 


MAKING  L(X:AL  INFERENCES 

When  rejecting  the  global  null  hypothesis  Hg,  we  conclude 
that  at  least  one  of  the  H/s  is  false.  The  closure  principle  of 
Marcus,  Peritz  and  Gabriel  (1976)  can  be  employed  to 
make  inferences  on  individual  hypotheses.  With  two 
endpoints,  one  can  reject  any  individual  hypothesis  i  = 
1,2,  if  Hq  =  Hj  n  H2  is  rejected  at  level  a,  and  Hj  is  also 
rejected  at  level  a  using  the  same  procedure. 

In  our  example,  the  hypothesis  corresponding  to  tumor 
site  A  has  a  corresponding  p-value  of  0.0458  (see  Table  4). 
Since  both  the  global  null  hypothesis  and  this  individual 
hypothesis  are  rejected  at  level  0.05,  we  can  conclude  that 
treatment  causes  an  increase  in  the  rate  of  tumor  A.  Note 
that  our  procedure,  when  applied  to  one  endpoint  only,  is 
equivalent  to  Fisher’s  Exact  test. 


Table  4:  Joint  distribution  and  rejection  regions 
B 


0  1 

2  3  4 

5 

6 

Margin 

0 

.0001 .0002 .0003 .0001 .0000 .0000 

.0000 

.0006 

1 

.0005 .0019 .0028 .0017 .0003 .0000 

.0000 

.0072 

2 

.0017 .0080 .0136 .0106 .0036 .0004 

.0000 

.0380 

3 

.0035 .0182.0366.0353 .0164 .0031 

.0001 

.1131 

4 

.0040 .0250 .0600 .0698 .0409 .01 10 

.0009 

.2114 

5 

.0026.0212.0622.0872 .0622 .0213 

.0026 

.2593 

6 

.0009 .0110 .0409 .0698 .0600 .0250 

.0040 

.2114 

7 

.0001.0031.0164.0353.0366 

.0182 

.0035 

.1131 

8 

.0000 .0004 .0036 .0107|.01 36 

.0080  .0017 

.0380 

9 

.0000.0000 

.0003.0017.0028 

.0019  .0005 

.0072 .( 

10 

.0000.0000 

.0000.0001.0003.0002  .0001 

.0006 

Margin  .0133 .0889 .2367.3223 .23671.0889  .01331 


.0458 


.1022 

Rejection  region:  Ordered  p-values  ( Rom) 
Product  of  p-values 


Table  5:  Actual  levels  and  p-values 


Method 

Rejection  Region 

Actual  level 

P-value 

Bonferroni 

P(l)<  0.025 
=  A>9orB  =  6 

0.02050 

0.0916 

Heyse  and 
Rom  (1988) 

Pq)<  0.0133 
=  A>9orB  =  6 

0.02050 

0.0568 

Rom  (1992) 

{P^l  )<  0.0458}  u 
{Pn 'I  =  0.0458  & 
P^2)^  0-3389} 

0.04226 

0.0286 

Proposed 

PlP2<  0.004306 

0.04664 

0.0190 

Note: 

1 .  Pq)  and  P(2)  are  the  ordered  p- values  of  the  individual 
endpoints,  where  P(i)  <  P(2)  • 

2.  Pj  and  P2  are  the  asymptotic  p-values. 


COMPARISON  WITH  OTHER  PROCEDURES 

Using  the  above  example,  we  compare  our  method  with  the 
Bonferroni  procedure  and  two  other  exact  procedures: 
Heyse  and  Rom  (1988)  procedure  using  the  minimum  p- 
value  of  the  k  endpoints  as  test  statistic,  and  Rom  (1992) 
procedure  using  the  ordered  p-values  of  the  k  endpoints  as 
test  statistic.  The  joint  distribution  of  the  tumor  incidence 
at  site  A  and  B  in  the  treated  group,  as  well  as  the  rejection 
regions  of  two  exact  procedures,  are  displayed  in  Table  4. 
The  joint  distribution  can  be  obtained  from  Table  3  by 
summing  up  the  hypergeometric  probabilities  of  identical 
tumor  incidence  at  the  two  sites  in  the  treated  group,  where 
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(A,  B)  =  (8,  5)  is  the  observed  value.  The  p-values  and 
actual  significance  levels  under  a  =  0.05  are  summarized 
in  Table  5.  We  can  see  that  the  proposed  procedure  has  the 
smallest  p-value  and  the  least  conservative  Type  I  error 
control  <  0.05. 


CALCULATION  OF  POWER  FUNCTION 

Here  we  only  discuss  the  case  of  two  treatment  groups  with 
two  endpoints.  The  method  can  analogously  be  extended  to 
the  general  case.  Assume  Pj-y  is  the  actual  response  rate  of 
the  i-th  endpoint  at  the  ;-th  group  and  that  Vy  is  the 
covariance  between  the  two  responses,  where  i,j  =1,2.  It 
is  straightforward  to  calculate  die  probabilities  of  observing 
the  four  possible  outcomes  of  the  two  endpoints  from  the 
given  configuration.  The  result  is  shown  in  Table  6. 
Notice  that  Vy  must  satisfy  the  following  inequalities  to 
ensure  non-negative  probabilities: 

Vyi  <  Vy  <  Vy„ ,  where 

V,i  =  max  { -  [PiyP2y(l-Piy)(l-P2/)]^^^.  -  Pl;P2> 
-(l-liiy)(l-P2yM,and' 

Vyu  =  min  {  [PiyP:^(l-Piy)(l-P2;)]^'^>  Pi/  1-P2j)> 

P2/l-t‘iy)} .  (7) 

We  assume  that  the  two  treatment  groups  are  independent 
and  that  the  group  sizes  are  known,  then  we  have  two 
independent  multinomial  distributions.  A  4  x  2  table  can 
be  formed  by  taking  an  observation  from  each  multinomial 
distribution.  It  is  possible  to  observe  4x2  tables  that  have 
zero  row  margin  in  one  or  both  of  the  corresponding  2x2 
tables.  If  only  one  of  the  observed  2x2  tables  contains  a 
zero  row  margin,  the  test  statistic  is  defined  as  the 
as)unptotic  p-value  of  the  other  2x2  table.  If  both  2x2 
tables  contain  zero  row  margin,  the  test  can  not  be  done 
because  the  individual  p-values  are  not  defined.  We  define 
such  4x2  tables  as  non-testable. 

For  the  given  response  rates  and  covariances  of  the  two 
responses,  the  power  is  defined  as  the  probability  of 
observing  4x2  tables,  after  excluding  the  non-testable 
tables,  on  which  the  overall  null  hypothesis  can  be  rejected 
by  our  test  procedure.  The  exact  power  can  be  calculated 
by  exhausting  all  possible  outcomes  of  the  two  multinomial 
distributions  and  by  summing  up  the  product  of  the  two 
multinomial  probabilities  that  the  corresponding  4x2 
tables  can  be  rejected  under  a  pre-determined  level  a,  then 
divided  by  the  probability  of  observing  testable  tables.  That 
is. 


Power  = 


SPriPr2 

_ adj-p  <  g _ 

1-Pr(  non-testable  4x2  tables) 


(8) 


where  Pr;  is  the  multinomial  probability  of  observing  the 
outcomes  of  group  j,  j  =  1, 2. 

The  group  sizes  are  sometimes  so  large  that  calculation 
of  the  exact  power  becomes  infeasible.  The  power  can  be 
estimated  by  taking  random  samples  from  the  two 
independent  multinomial  distributions  and  by  calculating 
the  proportion  of  rejected  tables,  after  deleting  non-testable 
4x2  tables. 


Table  6:  Configurations  and  the  corresponding 


multinomial  probabilities 


Tumor 

Control 

Treated 

A 

Pll 

Pi  2 

B 

P21 

P22 

Cov 

Vi 

V2 

Tumor 

Group  1 

Group  2 

No 

(1-Pll)(l-P2i)  +  Vi 

(1-Pi2)(1-P22)  +  V2 

A  only 

Pll(l-P2l)-Vl 

Pl2(l-P22)  -  V2 

B  only 

(l-Pll)P2l  -  Vi 

(1-Pi2)P22  -  V2 

AandB 

P11P21+V1 

P12P22  +  V2 

Table  7:  Configurations  and  the  corresponding 


multinomial  probabilities 


Tumor 

Control 

Treated 

A 

0.04 

0.16 

B 

0.02 

0.10 

Cov 

0.0192 

0.024 

Correlation 

0.70 

0.22 

Tumor 

Group  1 

Group  2 

No 

0.96 

0.78 

A  only 

0.02 

0.12 

B  only 

0.00 

0.06 

AandB 

0.02 

0.04 
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Example  1  ( continued) 

Suppose  the  tumor  incidence  rates  at  site  A  and  B  and  their 
correlations  are  given  in  Table  7.  Following  Table  6,  the 
marginal  probabilities  of  the  two  independent  multinomial 
distributions  can  easily  be  derived.  Notice  that  the 
expectation  for  each  combination,  using  group  size  of  50,  is 
exactly  the  one  observed  in  Table  1.  The  powers  under 
level  0.05  using  Rom  (1992)  and  the  proposed  procedures 
are  estimated  by  Monte  Carlo  simulation  with  5,000 
samples.  The  results  are  displayed  in  Table  8.  Both 
procedures  have  similar  power  in  this  example.  Running 
on  VAX  6(X)0-620,  the  CPU  time  used  by  the  proposed 
procedure  is  only  about  5  minutes,  in  contrast  to  1  hour  and 
44  minutes  used  by  Rom's  procedure. 


Table  8;  Powers  of  Rom  (1992)  and  the  proposed 
_ _  procedures _ _ 


Method 

Power 

Confidence  Interval 

#  of  samples 

Rom 

0.697 

(0.685, 0.710) 

5,000 

Proposed 

0.699 

(0.687,  0.712) 

5,000 

CONCLUDING  REMARK 

Our  proposed  procedure  can  easily  be  extended  to  ordered 
multinomial  response  by  evaluating  the  asymptotic  p-values 
of  the  Tj  X  c  tables  instead  of  2  x  c  tables,  where  r^-  is  the 
number  of  possible  outcomes  at  the  i-th  endpoint. 

Although  the  powers  of  the  proposed  procedure  under 
different  configurations  are  not  reported  here,  from  our 
experience,  the  procedure  has  the  best  power  when  the 
asymptotic  p-values  are  positively  correlated,  or,  if  sevCTal 
(all)  endpoints  are  affected  by  the  treatment. 
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Abstract 

The  number  of  calculations  in  the  classical  Cross 
Validation  (CV)  method  grows  very  fast  with  the 
size  of  the  sample.  Hence  various  methods  redu¬ 
cing  the  number  of  necessary  calculations  have  been 
proposed;  a  Monte  Carlo  Cross  Validation  appro¬ 
ximation  [4],  [5];  WARP-ing  [10], [17];  and  Binning 
[9].  In  the  present  paper  we  discuss  a  new  approach, 
Partial  Cross  Validation  (PCV),  saving  on  the  com¬ 
putational  effort  while  choosing  the  optimal  smoo¬ 
thing  parameter.  In  classical  CV  and  in  Generalized 
Cross  Validation  (GCV)  it  is  necessary  to  calculate 
the  sum  of  n  squares  of  differences  between  Yi  and 
the  leave-one-out  estimates  of  the  regression  func¬ 
tion  at  Xi,  In  PCV  this  sum  is  calculated  only  over 
a  relatively  small  number  kn  of  properly  chosen  in¬ 
dices  i.  By  choosing  PCV-optimal  window  width  we 
end  up  with  both  window  width  and  estimator  very 
close  to  their  GCV-competitors.  In  Section  4  we  pre¬ 
sent  performance  of  the  PCV  and  GCV  methods  in 
simulations  with  n  =  100  and  fcn  =  8.  In  Section  3 
we  find  conditions  under  which  PCV  has  the  same 
feature  as  GCV:  it  is,  up  to  a  constant,  an  unbia¬ 
sed  estimator  of  the  Mean  Integrated  Square  Error 
(MISE). 

1  Introduction. 

One  of  the  simplest  representations  of  the  regression 
function  is  given  by 

y  =  r[X)  +  e,  (1) 

where  X  and  e  are  independent,  E{e)  =  0,  and 
V’ar(e)  =  <7-^  <  oo.  In  case  our  information  about 
r(*)  is  poor,  e.g.  when  we  know  no  adequate  para¬ 
metric  model  to  which  r(*)  belongs,  nonparametric 
methods  provide  reliable  estimation  tools.  Nonpara¬ 
metric  estimators  of  r(x)  based  on  i.i.d.  observations 
(XijYi)^  i  =  1, considered  in  the  literature 
include  Nadaraya- Watson,  k-th  Nearest  Neighbor, 


p-th  Optimal  Quantile,  spline  estimators,  Gasser- 
Miiller,  LoEss,  Local  Polynomial,  Local  Parametric, 
and  we  refer  for  more  comprehensive  references  to 
[3],  [8],  [11],  [12],  and  [15]. 

Users  of  nonparametric  methods  must  pay  for  the 
universal  consistency  with  a  slower  rate  of  conver¬ 
gence  and,  so  far,  with  much  greater  computational 
complexity.  Methods  called  Randomized  Cross  Va- 
lidation  [4], [5],  WARP-ing  [10],  [17],  and  Binning  [9] 
considerably  reduce  (from  0{n^)  to  0{n))  the  co¬ 
sts  of  calculation  of  the  estimator  which  are  related 
to  the  necessity  of  multiple  evaluation  of  the  kernel. 
In  the  present  paper  we  consider  an  application  of 
Numerical  Analysis  to  the  estimation  of  the  opti¬ 
mal  smoothing  parameter  of  nonparametric  regres¬ 
sion  estimators.  We  propose  PCV,  a  modification 
of  CV  and  GCV  (see  [1]),  by  appropriate  skipping 
over  most  of  the  terms  in  the  original  formula  and 
weighting  the  remaining  ones.  The  main  idea  con- 
sits  in  approximating  an  integral  (Integrated  Square 
Error  (ISE))by  using  some  of  the  standard  methods 
available  in  the  Numerical  Integration  Theory  and 
then  approximating  the  knots  of  integration  by  the 
closest  points  from  the  sample.  This  approach  seems 
applicable  also  in  the  multivariate  case,  in  nonpara¬ 
metric  density  estimation,  in  spline  estimation,  and 
in  tomography  [4],  [5], [14]  as  well.  We  shall  not  pur¬ 
sue  generality  here  and  concentrate  on  presenting 
the  method  in  case  of  the  Nadar aya-Wat son  estima¬ 
tor.  It  is  clear  that  especially  the  Binning  can  also 
be  incorporated  into  the  methodology.  However  for 
the  sake  of  simplicity  of  presentation  we  shell  refer 
here  direct  to  kernels.  The  idea  of  PCV  has  been 
to  our  knowledge  first  implemented  in  tomography 
[14]  in  the  version  of  approximation  of  order  one  (see 
the  rectangular  version  of  the  PCV  listed  at  the  end 
of  section  2).  The  experience  shows  that  it  works 
reasonably  well,  at  least  for  small  sample  sizes. 

In  Section  3  we  show  that  for  higher  order  ISE  ap¬ 
proximations  PCVn{h)  is  asymptotically  an  unbia¬ 
sed  estimator  of  MISEn{h)y  see  Theorem  2.  We 
implemented  both  versions  of  PCV  in  nonparame- 
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trie  and  user  friendly  Fit  Short  2.4  package,  cf.  [15]. 
A  comparison  of  the  performance  of  GCVnih)  and 
PCVn{h)  in  simulations  is  reported  in  Section  4. 


2  The  Partial  Cross  Valida¬ 
tion 

Consider  the  Nadarayet- Watson  estimator  of  r(X) 
which  is  given  by 

-  'T  iw  , ,  vA  y. 

Estimator  rh(-p)  depends  on  a  smoothing  parameter 
h  called  a  window  width.  The  quality  of  the  esti¬ 
mator  strongly  depends  on  the  proper  choice  of  the 
window  width  and  is  measured  by  the  Mean  Inte¬ 
grated  Squared  Error  (MISE)  given  by 


MISEr,{h)  =  E  {ISEn{h)\ 


where 


ISEnQi)  =  J  {rh{x)  —  r{x)yw{x)f{x)dxy  (4) 

/(•)  is  the  density  function  of  X,  and  u;(.)  is  the 
indicator  function  of  an  interval  Ay  such  that 


7  <  f{x)  <  -  for  some  7  >  0  and  for  every  x  £  A. 
7 

A  related  random  measure  of  discrepancy  between 
rk{x)  and  r{x)  is  given  by  the  Averaged  Squared 
Error  (ASE) 


where 

=.  = 

2(«)  =  1  +  2«  +  0(«2),  (U-.0). 

PCV,  a  simple  modification  of  the  GCV  has  em¬ 
pirically  determined  computational  complexity  0{n) 
(as  implemented  in  the  package  Fit  Short  2.4).  Let 

kn 

pcVn{h)  = 

«=1  *  ‘ 

(7) 

where  ir„  is  the  number  of  components,  /i,-  are  the 
weights,  and  t  is  a  function  of  argument  i  from 
into 

We  shall  need  the  following  definitions  and  nota¬ 
tion  (cf.  [2],  pp.57  and  75). 

Defimtion  1  A  numerical  integration  method  of 
the  form 

-6  m 

/  u{x)dx  =  ^  u(!E,)  •  Wi  +  Rm{u)  (8) 

.=1 

IS  said  to  be  of  order  s  in  a  class  of  functions  T 
s-f  1  times  differentiable  on  A  if  Rm(n)  =  0(r7i~*) 
for  every  u  £ 

Defimtion  2  A  numerical  integration  method  is 
called  a  compound  kn^points  Gauss  rule  with  k  = 
m*  p  if  it  results  from  from  dividing  the  interval  of 
integration  into  m  equal  subintervals  and  applying 
the  p~point  Gauss  method  to  each  of  them. 


ASEn{h)  =  ^  yZ  ifhjXj)  -  r{Xj)f  w{Xj). 

ASE  and  ISE  have  been  proved  asymptotically  equ¬ 
ivalent  to  MISE  [6],  [7],  [13], [10],  [16],  and  the  pro¬ 
blem  consists  in  finding  h  =  •  •  • ,  An)  mini¬ 

mizing  any  of  them.  Despite  of  a  range  of  competi¬ 
tors  (cf.  [8])  CV-type  methods  are  among  the  most 
popular  in  finding  asymptotically  optimal  h.  The 
original  CV„{h)  is  given  by 

CK(4)  =  if:  (y,-f«(jr,))’,  (5) 

j=l 

where  r^\x)  is  the  leave-one-out  estimator.  CV  ad¬ 
mits  some  generalizations,  here  we  shall  refer  to  the 
GCV  in  the  form  discussed  in  [11], [12]: 

GCV„ih)  =  ^f2iYi-hiXi)f-Si.w{Xi),  (6) 


In  applying  any  numerical  integration  rule  to  ap¬ 
proximate  an  expected  value  of  g(X)  we  shall  use 
representation 

Erg{X)=  f\{F-^^{x))dx  (9) 

Jo 

and  apply  the  numerical  integration  rule  to  the  right 
hand  side  expression  or,  equivalently,  transform  the 
original  knots  Xi  into  the  corresponding  quantiles 
of  the  probability  distribution  function  F.  In  what 
follows  we  shall  aissume  that  all  necessary  regularity 
and  smoothness  assumptions  required  in  theorems 
in  [2]  on  pp.57  and  75  are  fulfilled. 

Let  {XiyYx)y . . .  y{XnyYn)  be  a  given  sample  of 
independent  pairs  of  random  variables.  We  order 
them  according  to  the  increasing  values  of  X's 
(with  ties  broken  by  the  chronological  order)  get¬ 
ting  . . . ,  l^„j)  with  <  *^(2)  S 

. . .  X(n).  Let  (j)  denote  the  index  in  the  original  sam¬ 
ple  corresponding  to  the  j ^th  order  statistic  and  [a] 
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be  the  best  integer  approximation  of  the  real  number 
a,  rounding  towards  zero  in  case  of  ambiguity. 

Below  we  list  two  versions  we  used  in  simulations. 
The  first  one  corresponds  to  the  ‘rectangular’  rule  of 
numerical  integration.  We  do  not  know  if  Theorem 

2  holds  true  for  this  numerical  integration  rule.  The 

second  one  corresponds  to  the  compound  /jn-points 
Gauss  rule  of  order  2p  with  -P,  , 

and  we  will  show  in  Section  3  that  it  is,  up  to  a 
constant,  an  unbiased  estimator  of  MISE. 

1.  a  rectangular  version  of  the  PCV:  kn  ^  c  • 
log(n),  in  =  ^  for  every  i,  and  i*  =  ([^])  , 

2.  a  compound  gaussian  version  of  the  PCV:  kn  = 

^  p  >  iy  fjii  are  the  weights  in  the 
points  compound  Gauss  numerical  integration 
method,  i*  =  ([2fi  +  afi  •  Xi])  ,  i  =  1, . . . , 
and  Xj’s  are  the  i-th.  ordered  abscissas  of  the  kn- 
knots  of  the  compound  Gauss  integration  rule 
on  [-1,1]. 

3  Main  results 

The  ISE  given  by  (4)  is  an  integral  over  interval  A, 
on  which  the  density  function  f{x)  is  positive.  Hence 
for  large  n  and  smooth  kernel  K  the  estimator  fh{x) 
is  well  defined  and  smooth.  So,  if  the  regression 
function  is  also  smooth  the  integrand  of  the  ISE  is  a 
smooth  function  and  the  integral  can  be  approxima¬ 
ted  with  the  use  of  numerical  methods  of  integration. 
We  shall  pursue  this  program  while  paying  attention 
to  retain  the  proper  order  of  approximation. 

Let  us  approximate  ISE  =  ISEn  uniformly  for 
h  £  H  —  h  =  for  some  C  >  0  and 

5  >  0  using  a  compound  fen, -point  integration  me¬ 
thod  of  order  Sy  kn  =  p-rrin,  and  then  approximate 
the  knots  Xi  by  corresponding  the  closest  sample  po¬ 
ints  X  + .  We  have 

t 

ISEnih)  =  J  g{x)w{x)f{x)dx 

-I 

=  '^1^3  ■9ixj)  +  Opi—y 

=  PISEnih)  +  Op(— )*  +  Opi^) 
TTln  n 

The  first  equality  follows  just  from  the  property  of 
the  numerical  integration  method,  see  [2]  pp.  58  and 
75,  while  the  second  one  is  implied  e.g.  by  the  Ba¬ 
hadur  representation  of  quantiles  [18].  To  minimize 


the  order  of  the  error  of  approximation  we  choose 
mn  =  yielding  the  error  of  order  To 

keep  the  error  on  level  o(n~  *)  we  take  s  >  4.  So,  we 
get 

Theorem  1  If  the  estimator  rn(®)  dnd  the  regres- 
sion  function  r(x)  belong  to  J-  and  the  integration 
method  is  of  order  s  on  T  then  for  s  >  A  we  have 
for  h  £ll 

ISEnih)  =  PISEnih)  +  Opin-‘).  (10) 

In  the  next  step  we  consider  relations  between  PISE 
and  PCV  assuming  the  same  rrin  in  both  cases  and 

also  the  same  index  selection  method  i  .  Indeed  we 
have 

PISEnih)  =  ^  fi.*  [ih  -  r)") 

jr=l 

i=i 

+  ^M3e%-SiX*)-wi^*) 

m„ 

-h  2  ^  -  f/,iX,*)j-SiX^*)-wiX^*) 

3=1  ^  ^ 

mn 

=  PCVnih)-\-'y^lij-e\-wiX*) 

3=1  ’  ^ 

m. 

+  2  V'  —  ffc(X.*))-u;(X.*) 

.  J  \  J  3  '  3 

j=l 

+  -  £  tij-e^^:-wiX  *)+Op  (n-j) 

=  PCVnih)+Ti+T2+T3+Op  («-»)  . 

Ti  does  not  depend  on  ft  while  in  a  way  similar  to 
[12],  p.  154-155  one  can  verify  that 

EiT2\Xi,...,Xn)  =  -n.  (11) 

Hence  we  get  the  following  theorem. 

Theorem  2  Under  the  assumptions  of  Theorem  1 
PCVn{h)  given  by  (7)  is,  up  to  a  constant,  an  unbia¬ 
sed  estimator  of  M ISEnih),  i.e.  for  h  £  H, 

EPCVnih)  =  M  ISEnih) 

(mn 

^  j* , 

+  o(n-^/«) 


(12) 
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Theorem  2  suggests  that  arguments  hn  ^minimi¬ 
zing  PCVn  can  be  used  in  2  instead  of  mini¬ 
mizing  GCVfi^  It  is  plausible  that  paralleling  ar¬ 
guments  in  [6]  or  [13]  one  can  show  optimality  of 
the  hn  minimizing  PCVnQi)  in  the  sense  of  mini¬ 
mizing  MISEn{h),  However  detailed  verification  of 
this  conjecture  is  beyond  the  scope  of  the  present 
note. 


4  Simulations 

Using  package  Fit  Short  2.42  we  compared  PCV  and 
GCV  on  many  both  simulated  and  real  data.  In 
general,  the  behavior  of  the  PCV  is  on  the  level  of 
CV  and  GCV  with  very  often  only  minor  differences 
in  estimators  from  these  methods.  We  shortly  report 
here  on  two  typical  simulations.  In  both  cases  we 
applied  8-point  Gauss  integration  method  with  n  = 
100,  m„  =  1,  and  =  p  =  8. 

1.  r(ar)  =  (sm(27rx^))^,  X’s  uniform  on  [0, 1], 
e  ~  i\r(0,  cr  =  0.7),  see  Figures  1  and  2, 

2.  r{x)  =  T^{x)  =  8a:^  -  %x^  +  1, 

X’s  uniform  on  [0,1],  e  ^  N{Q,<t  =  0.7),  see 
Figures  3  and  4. 

In  the  former  case  we  have  almost  identical  resul¬ 
ting  estimators  of  the  regression  curves,  in  the  latter 
one  hpcv  oversmoothes  the  regression  curve.  The 
regression  function  in  1  was  considered  by  Hardle  in 

[12]  for  n  =  256  and  =  0.5. 
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Abstract 

For  the  estimation  of  additive  regression  models  in  a  nonparametric  fashion  based  on  some  linear  scatterplot  smoother 
the  solution  of  large  linear,  often  ill-posed,  systems  is  required.  Standard  iterative  approaches  of  the  Jacobi  and  Gauss- 
Seidel  type  only  apply  to  non-singular  system  matrices,  although  their  use  for  ill-posed  problems  is  most  common.  In 
this  paper  an  iterative  projection  method  with  some  favourable  properties  is  proposed:  Convergence  can  be  established 
without  restrictions  on  the  system  matrix.  For  singular  systems  an  optimal  solution  can  be  obtained.  Finally,  it  is 
possible  to  take  advantage  of  the  shape  of  specific  system  matrices  when  calculating  the  solution. 


1.  Introduction  and  motivation 

Projection  pursuit  regression  (FRIEDMAN  and 
STUETZLE,  1981)  and  generalized  additive  models 
(HASTIE  and  TIBSHIRANI,  1990)  are  well-known 
examples  of  non-parametric  regression  problems  with 
scatterplot  smoothers.  These  approaches  require  solving 
large  linear  equation  systems.  To  reduce  the 
computational  costs  the  so-called  backfitting  algorithm 
was  introduced,  a  numerical  procedure  related  to  Jacobi 
and  Gauss-Seidel  iteration.  The  basic  idea  is  to  determine 
estimates  for  the  covariates  successively  in  a  non¬ 
parametric  manner  (scatterplot  smoother).  Backfitting 
uses  currently  available  information  from  all  covariates, 
except  the  covariate  of  which  the  estimates  are  just 
computed.  This  leads  to  a  splitting  of  the  system  matrix 
into  d  blocks,  each  block  corresponding  to  one  of  the 
predictor  variables  y=J,  2,  ...  ,d.  Finally  an  iterative 
procedure,  most  often  Gauss-Seidel  is  applied  to  these 
blocks.  Relaxation  can  improve  the  speed  of  convergence, 
but  is  usually  not  implemented  in  statistical  software.  For 
a  discussion  of  iterative  procedures  to  solve  linear 
equation  systems  in  the  context  of  additive  regression 
modelling  see  SCHIMEK,  NEUBAUER  and  STETTNER 
(1994). 

Although  there  are  reports  that  backfitting  works  well 
(e.g.  BUJA,  HASTIE,  and  TIBSHIRANI,  1989)  in  most 
situations  alternative  procedures  should  be  considered. 
First  of  all,  Jacobi  and  Gauss-Seidel  iteration  as  well  as 
variants  of  it  were  not  developed  for  solving  (nearly) 
singular  systems.  Im  non-parametric  regression  linear 


scatterplot  smoothers  such  as  spline  and  kernel 
techniques  are  most  common.  The  smoothed  data  are 
design-dependent  (number  and  location  of  knots, 
smoothing  parameter,  kernel  characteristics  and 
bandwidth),  hence  ill-posed  or  singularity  problems  must 
be  expected  and  in  principle  the  associated  normal 
equation  ^stem  should  not  be  solved  by  standard 
algorithms.  Further  we  have  to  be  aware  of  concurvity, 
also  contributing  to  the  singularity  of  the  sj  stem  matrix. 
As  a  direct  consequence  we  cannot  predict  the  speed  of 
convergence  and  the  quality  of  the  obtained  results. 

Direct,  non-iterative  procedures  could  be  applied  to  such 
singular  normal  equation  s>'stems.  SCHIMEK, 
STETTNER,  and  HABERL  (1992)  proposed  a  Tichonow 
regularization  technique.  It  yields  exact  solutions  on  the 
one  hand.  On  the  other  hand  it  is  too  expensive  for 
routine  use.  Tichonow  regularization  is  rather  a  valuable 
tool  for  the  comparison  of  results  obtained  by  other 
numerical  concepts. 

In  this  paper  we  propose  an  alternative  procedure  with  a 
number  of  favourable  properties.  The  idea  is  to  obtain 
correct  solutions  for  large  linear  systems  in  an  iterative, 
cheap  manner,  even  when  the  sjstem  matrix  is  singular. 
For  that  purpose  vitt  take  a  projection-oriented, 
geometrically  motivated  approach. 
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2.  An  iterative  projection  method 

The  iterative  projection  method  we  want  to  develop  is 
related  to  a  row-oriented  procedure  introduced  by 
KACZMARZ  (1937)  and  a  column-oriented  technique 
due  to  de  la  GARZA  (1951,  see  also  HACKBUSCH, 
1991,  p.  203  and  HOUSEHOLDER,  1975,  section  4.2). 
We  assume  a  sequence  of  iterative  projections  which 
forms  an  "instationary  process"  in  the  terminology  of 
MAESS  (1988,  p.  116). 

Let  us  have  a  linear  equation  system  Ax  =  b  to  be  solved 
in  X  with  A  a  «x«  matrix,  x  and  b  «-dimensional 
vectors.  Further  we  define  A  =  (a(l),a(2),...,a(n)) 
where  a(/)  denotes  the  /-th  column  vector  of  A  We 
represent  b  step  by  step  via  a  sequence  of  the  form 

k 

b  =  XM(0a'(/)+u(*)  (1) 

/=! 

where  1=1,  2, ... ,  k,  A=l,  2, . and 

a'(0  =  a*((/  - 1)  modw  + 1). 

The  a*(l),a*(2),...,a*(«)  are  a  permutation  of  the 

a(l),a(2) . a(«)  to  improve  the  convergence  speed 

(compare  with  the  "cyclical  criterion"  described  in 
MURTY,  1983,  p.457).  The  vector  u(A:)  represents  the 
"unexplained"  component  of  b  and  is  the  perpendicular 
from  u(k-l)  to  the  dimension  a'(*^)at  iteration  step  k. 
The  coefficients  )t(/)  are  determined  in  each  step  by  an 
optimality  criterion 

/(*,)i(*);a'(l),a'(2),...,a'(A:))-0 

they  have  to  fulfil  (e.g.  require  that  u(^)  is  the 
perpendicular  of  u(A:-l)  onto  a'(^).  We  can  establish 
conditions  for  the  optimality  criterion  under  which  u(^) 
converges  towards  0. 

For  the  evaluation  of  the  coefficients  ju(/)  we  can  take 
advantage  of  structural  features  of  the  system  matrix  A. 
This  is  an  important  aspect  when  solving  the  normal 
equations  associated  with  additive  regression  models. 
According  to  BUJA,  HASTIE,  and  TIBSHIRANI  (1989, 
p.477)  we  have  to  solve  the  system 


I 

Si 

Si  . 

..  Si^ 

^Siy^ 

S2 

I 

S2  . 

..  S2 

*2 

= 

S2y 

Srf 

Srf  • 

••  I; 

J 

in  our  notation  Ax  =  b,  where  A  and  b  are  block 
matrices  of  smoothing  operators  (matrices)  S/,  x^ 
solution  vectors  and  y  a  dependent  variable  vector  in  an 
additive  regression  model. 

3.  Features  of  the  iterative  projection  method 

There  are  a  number  of  advantages  of  the  proposed 
method; 

•  It  always  converges  because  convergence  does  not 
depend  on  the  characteristics  of  the  system  matrix  A, 
such  as  diagonal  dominance. 

•  For  singular  systems  an  optimal  solution  can  be 
obtained. 

•  The  shape  of  specific  system  matrices  A  (e.g.  due  to 
certain  scatterplot  smoothers  like  cubic  smoothing 
splines)  can  be  exploited  for  the  calculation  of  the 
solution. 

As  disadvantage  has  to  be  mentionend: 

•  Slow  convergence  in  its  standard  version  (see  e.g. 
HACKBUSCH,  1991,  p.204  for  the  Kaczmarz 
procedure). 

To  overcome  this  weak  point  of  the  iterative  projection 
method  two  approaches  can  be  taken:  The  one  is  to 
introduce  a  projection-specific  relaxation  concept.  The 
other  is  to  resort  to  parallel  processing. 

4.  Proof  of  convergence 

We  prove  convergence  for  the  optimality  criterion 


H{k)  =  (u(k-l),a'(k))/(a'(k),a'(k)), 
a*(A:)  =  a(k).  (2) 


In  this  situation  equation  (1)  can  be  written  as 
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b=  LI  Zm(0  [a(y)+u(/:). 
J=l[a\i)=aU),iSk  J 

Aggregating  in  m(j)  all  ii(i)  belonging  to  some  j  yields 


Another  important  advantage  of  the  proposed  method  is 
its  convergence  for  singular  equation  systems.  Let  us 
have  b  not  a  member  of  the  linar  space  spanned  by  the 
columns  of  A,  and 


b  =  bA  +  bo 


y=i 

and  (2)  takes  the  form 


the  unique  partition  of  b  with  bo  orthogonal  to  the 
column  space  of  A.  For  reason  already  given  u(k) 
converges  in 


m{j)  = 


(b,a(y)) 

(a(y),a(y)) 


l=\ 


(a(y),a(/)) 

(a(y),a(y))‘ 


(3) 


bA  = 


ImO) 


a(y)+u(*) 


Formula  (3)  can  be  understood  as  an  iterative  solution  of  to  0.  When  calculating  the  ii(0  successively  from  b 
an  equation  system  with  the  system  matrix  instead  of  bA,  the  same  coefficients  are  obtained,  because 

bo  does  not  contribute  to  the  solution  ii/i): 

H  =  (/j^)  =  ((a(i),a(y))). 


For  the  convergence  of  this  sequence  we  apply  a  classical 
theorem  on  the  convergence  of  iterative  procedures  (see 
TODD,  1962,  p.  222fF  for  details):  Is  some  matrix  H 
Hermitian  and  positive  definite  then  the  iterative 
procedure 


v+1)  ^ 

(a(v+l),a(v+l)) 

(u(v),a(v+l)) 


(a(v+l),a(v+l)) 


x(r+l)  =  d  +  C  x(r),  r  =  0, 1,2,... 


Finally  we  have 


converges  for  arbitrary  x(0)  towards  the  solution  of 
Hx=b,  where 

C  =  -L'l  U 

U  =  (^/t  ,/■<*) 

H  =  L  +  U 
d  =  L b. 

As  a  direct  result  the  approach  in  (1)  is  self-correcting 
and  numerically  stable.  Numerical  errors  in  step  k  are 
compensated  during  the  computation  of  m  in  step  k+1. 
These  advantages  are  not  shared  with  other  procedures 
recalculating  x(k)  in  each  step. 

5.  Singular  equation  systems 


b-XM0)a(/)=:u'(y')^b„ 
with  p(/)  forming  the  solution  x  of  min  ||  Ax  -  b|| . 

6.  The  algorithm 

The  algorithm  is  simple  in  its  structure  and  can  be 
expressed  as  follows. 

read  n,  a(),  b,  * 

^  -  0,  x()  =  0 
repeat 

k  =  k+l 

a'(k)  =  a*((/:-l)mod  m  +  1) 
solveX^,jU(^),a'(l),...,a'(/:))  =  0 
update  x() 
until 
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terminating  condition  =  true 
end 

An  interesting  aspect  of  the  algorithm  is  the  possibility  to 
develop  it  into  a  parallel  processing  procedure.  The 
necessary  computer  architecture  is  characterised  by  a 
multiple  instruction  stream,  single  data  stream 
organization  (see  e.g.  KRISHNAMURTI  and 
NARAHARI,  1993,  p.  69f). 

7.  An  illustrative  example  for  standard  and 
relaxed  iterative  projection  solutions 

Let  us  solve  the  equation  Ax  =  b ,  where  A  does  not  have 
diagonal  dominance.  We  assume  A  =  [col(2,2,l), 
col(l,3,l),  col(l,2,2)]  and  b  =  col(4,7,4).  Applying  the 
Optimality  criterion 


/i(A:)  =  (au(A:-l),a'(fc))/(a'(*),a'(*)) 

we  obtain  the  standard  solution  for  a  =  1  and  a  relaxed 
solution  for  a  =  1.2  (larger  than  one  to  improve  the  speed 
of  convergence).  Table  1  displays  the  approximations  x 
in  comparison  with  the  exact  result  x  =  col(l,l,l)  for  n  = 
50  and  n  =  100. 

100  take  about  twice  the  time  of  50  unrelaxed  iterations. 
For  the  relaxed  solution  the  computational  costs  are  only 
a  factor  1.1  higher  (reference  50  iterations)  but  the 
precision  of  the  obtained  result  is  improved  by  a  factor  2  - 
4  .  Hence  the  relaxation  technique  is  quite  promising  and 
should  be  studied  in  more  detail  (i.e.  in  a  simulation 
experiment). 


Table  T.  Results  for  standard  and  relaxed  iterative  projections 


exact  X 

approximations  x 

a  =  1,  /7  =  50 

a  =  Ln  =  100 

a=  1.2,  rt  =  50 

1 

0.99981 

0.99999 

1.00005 

1 

0.99959 

1.00000 

1.00022 

1 

1.00058 

0.99999 

0.99974 
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Abstract 

A  problem  of  nonparametric  curve  estimation  from  in¬ 
direct  observations  is  considered.  Asymptotically  opti¬ 
mal  orthogonal  series  estimator  is  suggested  for  regu¬ 
lar  setting.  For  irregular  setting  a  consistent  estimator, 
which  is  also  rate  optimal  for  some  familiar  cases,  is  sug¬ 
gested  as  well.  Particular  applications  are  density,  fil¬ 
tering  and  nonparametric  regression  deconvolution  and 
nonparametric  regression  with  errors  in  predictors. 

1  Introduction 

Consider  a  problem  of  estimating  function  f  :  R  R 
when  only  convolution  =  /  f{x)k{i  —  x)dx  of  /  with 
the  known  function  k  is  available  for  direct  statistical 
observation. 

The  familiar  examples  are:  (i)  Density  deconvolution 
when  one  estimates  density  /  of  a  random  variable  U 
based  on  n  i.i.d.  observations  Xi,..,jXn  having  the 
same  distribution  as  that  of  X  and  X  =  U  -{-€  where  U 
and  €  are  independent  and  probability  density  k  of  the 
measurement  error  e  is  given;  (ii)  Nonparametric  blurred 
image  reconstruction  when  one  estimates  /  using  n  i.i.d. 
observations  {(Yi  =  ff(ti)  i  =  1, . . . ,  n},  here  ^  is 

the  error;  Nonparametric  regression  with  errors  in  pre¬ 
dictors  when  one  observes  n  i.i.d.  realizations  of  {Y,  X) 
where  Y  =  /(IT)  and  X  =  17  -f  c. 

For  a  regular  case  when  the  Fourier  transformation 
hk(v)  =  /exp(£rt)fc(t)dt  does  not  vanish,  i.e.,  Aib(v)  ^  0, 
the  most  relevant  results  to  our  research  are  obtained 
by  Donoho  and  Low  (1992),  Fan  (1991,1993)  and  Fan 
and  TVuong  (1993)  where  rate  optimal  deconvolution 
kernel  estimates  are  suggested  for  a  wide  varieties  of 
settings.  Particularly,  for  the  density  deconvolution 
Fan  (1991,1993)  suggests  the  following  kernel  estima¬ 
tor.  Let  be  a  traditional  kernel  function  and 

Ak’(v)  =  f  exp{iux)K{x)dx  be  its  Fourier  transform 

*Thjs  research  supported  by  NSF  Grant  DMS-91 23956 


with  h}c{0)  =  1.  Then  the  deconvolution  density  esti¬ 
mator  is 

/n(®)  =  j  ^^Pi-i'"^)hK{vtn)Hv)hJ^{v)dv 

for  suitable  choice  of  a  bandwidth  tnj  where 

n 

Ajf(i;)  =  n“^^exp(iuX/)  (1) 

is  the  empirical  characteristic  function  of  X. 

Fan  (1991,1993)  shows  that  this  estimate  is  rate  opti¬ 
mal  (as  sample  size  increases)  for  two  important  classes 
of  distributions  of  e.  Namely,  for  supersmooth  distribu¬ 
tions  of  order  13  when  the  corresponding  characteristic 
functions  he{v)  of  noise  e  satisfy 

do|w|'’‘’exp(-|t>|^/7)  <  |/i*(i;)|  <  exp(-|vl'’/7) 

and  for  the  ordinary  smooth  distributions  of  order  /3 
when  the  characteristic  functions  are  not  decaying  and 
satisfy 

<  IM«)|  <  diH-'’ 

as  V  cx),  here  <io>  di,  j3  and  7  are  some  positive  con¬ 
stants  and  )0Q  and  /?i  are  constants.  The  examples  of 
supersmooth  distributions  are  normal,  mixture  normal 
and  Cauchy,  the  examples  of  ordinary  smooth  distribu¬ 
tions  are  gamma  and  double  exponential  distribution. 

Similar  results,  which  again  hold  for  these  two  classes 
of  distributions,  are  known  for  the  other  settings,  includ¬ 
ing  nonparametric  regression  with  errors  in  predictors 
(see  Fan  and  Truong  (1994)). 

There  are  two  main  questions  which  will  be  addressed 
in  this  paper: 

-  What  is  the  optimal  risk  convergence  for  arbitrary 
function  k  which  Fourier  transform  does  not  vanish,  in 
particular,  for  arbitrary  hf{v)  ^0  7 

-  Can  we  suggest  an  optimal  estimate  for  irregular  case 
when  the  Fourier  transform  of  k  vanishes? 

To  explore  the  problem  we  shall  use  the  orthogonal 
series  approach.  Below  a  short  heuristic  explanation  of 
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this  approach  is  given  for  the  model  of  density  deconvo¬ 
lution. 

Suppose  that  estimated  density  f{u)  is  supported  over 
a  given  finite  interval,  for  instance  [0,27r].  Then,  under 
some  very  mild  assumptions  on  the  estimated  density  / 
it  may  be  approximated  for  different  loss  functions  with 
a  desired  accuracy  via  appropriate  choice  of  the  array 
{Xj\  J}  by  orthogonal  series 

/(u,  J,  {Ay»  =  (2ir)-^  exp(-iiu)  (2) 

lil<J 

where  hu{v)  is  the  characteristic  function  of  the  random 
variable  U.  Here  0  <  Ay  <  1  are  the  smoothing  coeffi¬ 
cients  and  J  is  a  cutoff. 

It  is  well  known  that  for  the  independent  U  and  e  the 
characteristic  function  of  the  sum  X  =  17  -|-  e  is  equal  to 
the  product  of  the  characteristic  functions  of  U  and  e, 
that  is,  fcx('y)  =  hu{v)hf{v)  . 

Thus,  using  the  empirical  characteristic  function 
hx{v)  defined  in  (1)  as  an  estimate  for  /ijf(ij)  and  as¬ 
suming  that  fte  does  not  vanish  (recall  that  distribution 
of  noise  e  is  known  and  therefore  the  characteristic  func¬ 
tion  is  known  ais  well)  we  obtain  an  estimate 

/„(u,J,{Aj})  =  (2x)-^  A,fcxO>r^(i)exp(-ytt)  . 
lil<J 

Hereafter  h''^{v)  =  h{v)/\h{v)\^  where  h  is  the  complex 
conjugate  of  h, 

Surprisengly  enough,  we  shall  see  that  this  estimate 
may  be  used  for  the  other  discussed  statistical  models  as 
well  with  the  only  difference  that  instead  of  the  empirical 
characteristic  function  we  use  the  corresponding  familiar 
estimates  for  the  case  of  direct  observations. 

Now  we  are  in  a  position  to  explain  how  to  solve  the 
deconvolution  problem  when  ht{j)  is  equal  to  zero  for 
some  j;  the  familiar  examples  are  uniform,  triangle  and 
lattice-valued  c.  We  restrict  our  attention  to  the  case 
when  there  exist  decaying  as  m  ^  oo  sequences  $jm 
such  that 

/»6(i  +  Sim)>0  (4) 

for  every  integer  j.  Notice  that  for  deconvolution  of  an 
arbitrary  density  /  such  assumption  is  necessary  for  con¬ 
sistent  estimation.  Thus,  one  can  estimate  hf{j  -f  ^jm) 
rather  than  hj{j)  and  then  use  the  continuity  of  hj{j). 
In  this  paper  we  will  use  this  idea;  slightly  different  ap¬ 
proach,  based  on  the  L’HopitaPs  rule,  is  explored  in  Efro¬ 
movich  (1994). 

Section  2  is  devoted  to  rate-  and  sharp-optimal  estima¬ 
tion  for  the  regular  setting.  Using  the  modern  approach, 
which  maps  the  different  models  into  filtering  in  white 


noise  (see  Brown  and  Low  (1990)),  we  explore  the  prob¬ 
lem  on  example  of  a  signal  recovery  in  white  noise.  In 
Section  3  the  irregular  setting  is  considered  on  example 
of  density  deconvolution.  Some  possible  extensions  are 
discussed  in  Section  4. 

2  Optimal  Signal  Recovery 

The  considered  problem  is  to  recover  a  periodic  signal 
/(i)  from  an  observation  Yn{t)  such  that 

dYnit)  =  iKf){dt)  +  n-^fUw{t),  0<t<2ir  (5) 

where  u;(t)  is  the  Brownian  motion  {dw{i)  is  a  so-called 
white  noise),  Jf/  =  /  is  the  given  operator  such  that 
the  Fourier  transform  of  /  satisfies  exp{ivt)f{t)dt  = 
hf{v)hK{v)  where  hf{v)  =  f{t)exp{ivt)dt  and  the 
function  hK{v)  is  given.  Our  problem  is  to  estimate  the 
/-th  derivative  of  /. 

Let  l|/||p  =:  be  the  familiar  Lp-norm 

of  /  where  1  <  p  <  oo  and  ||/||oo  =  ess  sup^g[o,2T](/W)5 
let  /(*)  mean  the  /-th  derivative  and  [aj  be  the  integer 
part  of  the  positive  a. 

Throughout  the  paper  we  always  assume  that  /  be¬ 
longs  to  either  Lipschitz  class  Lip{a)  of  periodic  func¬ 
tions  when  the  functions  are  [aj-fold  continuously  differ¬ 
entiable  and  periodic  on  the  circle  [0, 2^],  ||/1[2  <  A  <  oo 
and  <  Qlu -- for  u,v  E 

[0, 27r],  or  to  a  Sobolev  H(a,  (?)  class  of  periodic  square 
integrable  functions  such  that  the  corresponding  Fourier 
transformations  hf{v)  =  f{t)exp{ivt)dt  of  /  satisfy 
inequality  ^  ^  * 

We  shall  consider  the  Lipschitz  classes  of  functions 
when  rate  optimal  estimation  is  investigated  and  refer  to 
the  Sobolev  classes  when  sharp  optimal  Mean  Integrated 
Squared  Error  (MISE)  convergence  is  explored. 

Our  assumption  on  periodicity  of  estimated  function 
is  not  crucial  for  our  approach  but  it  is  very  convenient, 
the  interested  reader  is  also  referred  to  discussion  of  ape- 
riodicity  in  Efromovich  (1994). 

It  is  well  known  that  (5)  may  be  rewritten  as  an  infinite 
array  of  discrete  observations 

Yj  =  hj{j)hK{j)  +  I  i  =  •  •  •  >  —1)  0, 1, . . .  (6) 

where  Yj  =  exp{iji)dY  (i)  and  are  i.i.d.  standard 
normal  random  variables. 

Set 

«^(«.Q.o=  E  (7) 

bl<J» 


198  Nonparametric  Curve  Estimation 


where  the  cutoff  Jn  is  defined  as  the  smallest  positive 
integer  such  that 

E  -l)>nQ.  (8) 

The  following  sequence  Vn  plays  a  role  of  the  indicator 
which  shows  when  sharp  optimal  estimation  is  possible. 

Set 

r„  =  miny|<j„{f^(a,  Q,  .  (9) 

We  shall  see  that  if  rn  oo  then  sharp  optimal  esti¬ 
mation  is  possible  and  otherwise  it  is  impossible.  The 
underlying  idea  of  the  sequence  is  as  follows.  If 
Vn  <  C  <  oo  then,  following  the  terminology  of  Donoho 
and  Liu  (1991)  and  Fan  (1993),  the  difficulty  of  the  non¬ 
parametric  problem  may  be  captured  by  the  hardest  one¬ 
dimensional  subproblem.  As  a  result  there  is  no  sharp 
lower  bound  because  there  is  no  sharp  lower  bound  for 
a  one-dimensional  problem. 

Define  a  real-valued  estimate 

=  (2’r)"^  E  “p(-iit). 

|j|<J 

(10) 

The  following  assertion  shows  that  this  estimate  has 
the  property  of  optimal  MISE  convergence  under  ap¬ 
propriate  choice  of  J  and  {Aj}.  Here  we  assume  that 
Ay  =  1  —  {\j\/Jn)^  and  whenever  the  Lipschitz  space  is 
under  consideration  we  set  Q  =  1. 

Theorem  1  Let  0  <  lhic(jf)l  <  oo  and  I  <  a.  Then 
the  estimate  (10)  has  the  following  optimal  asymptotic 
(as  oo)  properties  of  MISE  convergence: 

(i)  jy rn  oo  then  estimate  fn\t)  =  fn\t,  {Ay}) 
has  sharp  optimal  minimax  MISE  convergence  over  the 
Sobolev  class  H{oc^Q)  of  functions  f,  that 

sup 

SiH(a,Q)  Jo 

=  inf  sup  Ef{f  (/P(t,  a,  Q,  Hk)  - 
feH{a,Q)  Jo 

=  (1  +  o(l))5^(a,  Q,  1) 

where  the  inf  is  over  all  possible  estimates 
fn\t,a,Q^hK)*  ^ 

(ii)  Estimate  fn\t)  =  •fn, {1})  has  rate  opti¬ 

mal  MISE  convergence  over  either  Sobolev  H{ayQ)  or 
Lipschitz  Lip[a)  classes  of  estimated  functions  and 

snp Ej{  l\mt)  -  =  0(l)i2 

Jo 


where  the  sup  is  over  either  the  Sobolev  if  (a,  Q)  or  Lip- 
schitz  iip(a)  classes  of  estimated  functions  f. 

We  see  that  the  smoothing  coefficients  {Ay}  have  been 
employed  only  to  obtain  sharp  optimal  MISE  conver¬ 
gence,  that  is,  the  best  constant  and  rate  of  MISE 
convergence.  They  reflect  the  statistical  nature  of  the 
problem  rather  than  approximation  of  a  function  via  a 
trigonometric  polinom. 

The  situation  drastically  changes  when  ip-norms  with 
p  ^  2  are  used  to  measure  the  accuracy  of  fitting. 
Unfortunately,  straightforward  implementation  of  the 
Fourier  approximation  gives  the  optimal  fitting  only 
within  the  logarithmic  factor,  see  more  in  Butzer  and 
Nessel  (1971)).  However,  implementing  of  a  smoothing 
allows  us  to  avoid  this  decreasing  in  accuracy  of  approxi¬ 
mation.  For  arbitrary  p  the  familiar  de  La  Vallee  Poussin 
sum  is  a  good  alternative  and  the  corresponding  esti¬ 
mate  is  defined  as  2/,  {/z(j,  J)})  with  p(i,  J)  =  1 
if  lil  <  *^1  =  2  -  \j\/J  if  J  <  |jf|  <  2J  and 

/i(j,  7)  =  0  otherwise. 

Note  that  this  sum  gives  an  excellent  approximation 
to  the  considered  functions  /(t).  In  fact,  de  La  Vallee 
Poussin  sums  are  within  a  constant  factor  4  of  the  best 
sup-norm  approximation  by  trigonometric  polynomials 
of  a  given  order.  See  more  about  estimates  based  on 
this  sum  in  Ibragimov  and  Khasminskii  (1981)  and  Efro- 
movich  and  Low  (1994). 

The  interested  reader  is  referred  to  Efromovich  (1994) 
where  risks  in  Lp-norms,  1  <  p  <  oo,  are  investigated.  It 
is  shown  that  the  sequence  6n  =  \/^  defines  both  sharp 
optimal  and  rate  optimal  risk  convergence  in  different 
£p-norms  for  estimates  of  /(9  whenever  1  <  p  <  oo, 
moreover,  this  is  the  optimal  rate  for  p  =  oo  as  well 
whenever  hk  corresponds  to  the  supersmooth  case. 

3  Density  Deconvolution  for  Ir¬ 
regular  Case 

Consider  the  discussed  in  Introduction  problem  of  den¬ 
sity  deconvolution  when  K{j)  =  0  for  some  integers  j 
but  (4)  holds. 

Let  sy„  be  a  sequence  in  n  and  j  such  that  s^jn  = 
— Sjn  and  for  each  i  >  0  and  n  the  sequence  min¬ 
imizes  up  to  a  constant  factor  the  error  c(s,y,n)  = 
|s|^-|-n'"^|hg(y-|-s)|^^  over  |sl^  <  Cn"“^|/i«(y+5)|"^.  Set 
R{J,  n)  =  [n- V  Eui</  |A*(i  +  Sjn\-^  +  The 

sequence  i2( J,  n)  is  the  upper  bound  (up  to  a  constant 
factor)  for  risk  of  the  recommended  orthogonal  series 
estimate  with  the  cutoff  J.  Then  we  define  optimal  se¬ 
quence  Jn  as  an  increasing  sequence  of  positive  integers 
which  minimizes  rate  of  decaying  J2(  J,  n)  as  n  oo. 
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We  are  now  in  a  position  to  define  a  consistent  real 
-valued  orthogonal  series  estimate  of  the  Uth  derivative 
of  density  /  as 

/(')(*)  =  (2x)-i  53 

X  Mil  Jn)hx{j  +  ain)K^{j  +  Sjn)  iXp{-ijx)  (11) 

where  n{j,Jn)  =  1  if  \j\  <  Jn  and  MCiMn)  =  2  -  li|/Jn 
if  Jn  <  lil  <  2Jn  are  the  discussed  above  smoothing 
coefficients  of  de  La  Vallee  Poussin  sum. 

The  reader  who  is  primarily  interested  in  a  traditional 
MISE  may  simplify  the  estimate  and  consider  which 
is  defined  by  (11)  with  /i(j,  J)  =  1. 

Theorem  2  Suppose  ihai  (4)  holds,  1  <  p  <  oo, 
®o  €  [0, 27r)  and  I  <  a.  Then: 

(i)  Estimate  (11)  is  consistent  and 

sup  II,}  <  CJ' n) 

feLipc. 

where  J^jR(  Jn,  n)  0  a5  n  oo. 

(a)  If  distribution  of  e  is  supersmooth  then  the  esti¬ 
mate  (11)  is  rate  optimal  in  sense  of  the  lower  hounds 
of  Fan  (1991,1993),  namely, 

sup  i?/{||/i')-/(')l|p}  =  0((ln(n))-(“-')//’))  , 

f€L%po, 

sup  =  0((lu(n))-=*(“-')/^)). 

feLipc 

(Hi)  Forp  =  2  statements  (i)  and  (ii)  also  hold  for  the 
simplified  estimate  fn^  and  under  assumption  of  part  (ii) 
MISE  of  this  estimate  decreases  as  0((ln(n))“^^“’"9//^)). 
for  all  f  E  Lip{a) 

Several  remarks  are  to  be  made.  In  the  first  place, 
in  contrary  to  the  blurred  image  reconstruction  model 
of  Korostelev  and  Tsybakov  (1994),  for  the  considered 
setting  irregularity  does  not  necessarily  implies  inconsis¬ 
tency.  Secondly,  for  the  supersmooth  case  there  is  no 
influence  of  p,  i.e.  the  loss  function,  on  risk  convergence, 
recall  that  this  is  not  the  case  for  direct  observations  (see 
Ibragimov  and  Khasminskii  (1981)).  Thirdly,  slightly 
modified  procedure  allows  to  construct  rate-optimal  pro¬ 
cedure  for  the  ordinary  smooth  case  as  well,  see  Efro¬ 
movich  (1994).  Finally,  the  procedure  (11)  may  be  rec¬ 
ommended  for  the  practically  important  case  of  small 
samples  whenever  \he{j)\  takes  on  relatively  small  val¬ 
ues. 

The  following  examples  clarify  the  issue  of  the  irregu¬ 
larity  for  density  deconvolution  model. 

Example  1.  In  this  example  we  analyze  some  familiar 
measurement  errors  which  may  lead  to  irregular  setting. 


Let  c  be  uniformly  distributed  over  interval  (o,  b) 
then  he(i;)  =  exp(iau)[exp(z(6  -  a)v)  -  i\/{i{b  -  a)v) 
(see  Feller  (1966)).  The  irregularity  occurs  whenever 
{b  —  a)j  =  2icr  for  some  integers  j  and  r.  However, 
for  this  familiar  measurement  error  consistent  estimation 
is  always  possible  because  (4)  holds.  For  if  he{j)  =  0 
then  it  is  elementary  to  verify  that  for  some  positive 
constants  Ci,  C2  and  |s|  <  (7r/2)/(6—  a)  the  relations 
Ci|5||ir^  <  lMi  +  ^)l  <  C2\s\\jr^  hold. 

Recall  that  s^jn  =  To  find  Sjn  for  j  >0  let  Kj 

be  such  that  |«j  |  <  tt  and  (6  —  a)j  =  27rr  +  Kj  for  some 
integer  r.  Then  one  can  set  Sjn  =  0  if  |/Cj|  > 
and  Sjn  =  Ti"^/^|i|^^^sgn(/Cj)  otherwise.  Notice  that 
even  if  Kj  ^0,  i.e.  for  regular  case,  it  may  be  worthwhile 
to  implement  our  method  and  to  estimate  hf{j  -f-  Sjn) 
rather  than  hf{j). 

Interesting  situation  occurs  when  c  is  a  discrete  ran¬ 
dom  variable,  that  is,  it  takes  on  values  Or  with  nonzero 
probability  pr*  Particularly,  if  ai  =  0,  03  =  tt  and 
Pi  =  P2  =  1/2  then  hc{j)  =  0  for  odd  j. 

Example  2,  A  wide  class  of  “irregular”  measurement 
errors  can  be  generated  by  mixing  the  random  variables 
described  in  Example  1  and  traditionally  studied  “reg¬ 
ular”  measurement  errors  whose  characteristic  functions 
do  not  vanish. 

Consider  a  mixture  e  =  €1+62  of  two  random  variables 
where  ci  is  any  random  variable  from  Example  1  with  the 
characteristic  function  he^{v)  which  vanishes  when  v  is 
equal  to  some  integers  and  62  is  a  random  variable  whose 
characteristic  function  he^{v)  does  not  vanish.  Then  ir¬ 
regularity  always  occurs  because  hg(u)  =  h.g^(t;)hg3(i;). 

Such  modelling  a  measurement  error  is  very  conve¬ 
nient  for  Monte  Carlo  simulations.  For  instance,  to 
model  an  irregular  supersmooth  setting  one  chooses  €2 
from  the  list  of  the  supersmooth  random  variables  (re¬ 
mind  that  it  includes  all  non-degenerated  stable  random 
variables  and  their  mixtures)  and  then  adds  any  random 
variable  ei  discussed  in  Example  1. 

Example  3,  Interesting  situation  occurs  if  random 
variable  X  is  projected  onto  a  circle  with  unit  radius 
and  we  observe  this  projection  rather  than  X,  that  is, 
we  observe  X*  =  X  —  [X/27rJ  instead  of  X,  It  is  plain  to 
see  that  for  regular  setting  with  Sjn  =  0  this  reduction 
of  information  does  not  effect  our  estimate  (2.2). 

Situation  changes  for  irregular  setting  when  this  pro¬ 
jecting  does  effect  our  estimate  (11),  However,  it  is  not 
difficult  to  verify  that  whenever  e/{\  exp(is(X'  -X))  - 
1|^}  <  C|s|^,  for  instance  the  latter  is  the  case  when  X 
has  a  finite  second  moment,  then  similarly  to  the  regular 
case  this  reduction  of  information  does  not  change  the 
assertion  of  Theorem  2.1.  Moreover,  the  procedure  of 
Efromovich  (1994)  is  not  so  sensitive  to  such  projection. 
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Example  4^  Recall  that  for  nonparametric  regression 
deconvolution  Korostelev  and  Tsybakov  (1993)  have  ex¬ 
plored  an  example  when  vanishing  of  hjc{j)  implies  in¬ 
consistency.  Is  there  similar  setting  for  the  density  de- 
convolution? 

To  implement  the  underlying  idea  of  their  example  we 
are  to  suppose  that  both  /  and  k  are  periodic  with  period 
over  all  reals,  however  there  is  no  periodic  densities 
since  any  density  is  to  be  integrated  to  one. 

Therefore,  we  are  to  change  our  setting.  Let  the  ran¬ 
dom  variable  X  be  distributed  on  a  circle  with  unit  ra¬ 
dius  according  to  the  density  g[x)  =  Jq  f(t)k{t  —  x)dx 
where  A(®)  is  a  known  periodic  function  and  f{x)  is  es¬ 
timated  periodic  function.  For  this  mathematical  model 
we  may  conclude  that  there  is  no  consistent  estimator 
whenever  hk{j)  =  0  for  at  least  one  j. 

The  last  circular  model  is  a  very  special  one  and  it 
sheds  light  on  the  circumstances  when  irregularity  im¬ 
plies  inconsistency,  see  also  Hall  (1990). 

4  Extensions 

The  interested  reader  can  easily  extend  the  obtained  re¬ 
sults  to  the  different  models.  The  only  model,  which 
requires  some  explanation,  is  the  nonparametric  regres¬ 
sion  with  errors  in  predictors. 

Recall  that  for  this  model  the  unobserved  predictor  U 
is  a  random  variable  with  density  p(u)  >  0.  The  recom¬ 
mended  procedure  of  estimation  is  as  follows.  First,  we 
use  the  estimate  (3)  where  the  empirical  Fourier  trans¬ 
form  is  used  instead  of  the  empirical 

characteristic  function.  Notate  the  obtained  estimate  as 
1?n(^^)'  It  is  not  difficult  to  verify  that  i>(u)  is  an  es¬ 
timate  for  the  ratio  f{u)/p{u).  Thus,  if  p{u)  is  known 
then  one  can  set  /n(w)  =  ^n(w)p(u),  otherwise  estimate 
of  p{u)  discussed  in  Section  3  can  be  plugged-in. 

An  interesting  possible  extension  is  a  construction  of 
an  adaptive  procedure.  There  are  two  different  kinds 
of  adaptation.  The  first  one  is  to  adapt  to  unknown 
smoothness  of  the  underlying  function  /.  The  second 
one  is  a  data-driven  procedure  of  estimation  /  when 
the  kernel  of  convolution,  is  unknown. 

5  Conclusion 

We  have  explored  a  problem  of  nonparametric  curve  esti¬ 
mation  for  convolution  model.  The  proposed  orthogonal 
series  estimator  has  asymptotically  sharp-optimal  prop¬ 
erty  of  MISE  convergence  as  well  as  rate-optimal  risk 
convergence  for  a  wide  variety  of  loss  functions.  The  pro¬ 
cedure  allows  to  treat  both  regular  and  irregular  cases. 


An  interesting  feature  of  this  estimator  is  that  it  is 
similar  to  the  estimators  based  on  direct  observations. 
The  estimator  may  be  used  for  density,  filtering  and  non¬ 
parametric  regression  deconvolution  as  well  as  for  non¬ 
parametric  regression  with  errors  in  predictors. 
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Survey  sampling,  or  finite  population  estimation,  has  been 
a  domain  unto  itself,  with  theory  and  methods  distinct  from 
mainstream  statistics.  Nonparametric  regression  applied  to 
survey  results  can  yield  more  efficient  estimation  of 
population  quantities  than  standard  methods  of  survey 
inference,  and,  conceptually,  bridges  the  gap  between 
traditional  mainstream  statistics  and  survey  sampling.  The 
application  of  nonparametric  regression  to  finite  population 
estimation  raises  new  questions  for  survey  sampling  and  for 
the  field  of  nonparametric  regression.. 

1.  INTRODUCTION 

In  this  paper,  we  consider  the  application  of  nonparametric 
regression  to  the  estimation  of  finite  population 
"parameters"  based  on  a  sample  from  the  population. 
Given  a  population  P  of  N  units  for  each  of  which  there  is  a 
variable  Y  of  interest,  with  values  available  on  a  sample  s 
of  F,  we  wish  to  estimate  the  population  total  T  =  • 

We  assume  that  an  auxiliary  variable  x  related  to  T  is 
available  for  the  entire  population. 

Typically  (although  not  always)  the  sample  is 
selected  according  to  a  probability  design,  and  the 
probabilities  that  an  item  is  included  in  the  sample  is 
incorporated  into  the  estimator.  For  example,  stratifying  on 
the  auxiliary,  and  using  stratified  random  sampling  without 
replacement  leads  to  the  "expansion  estimator" 
=  A=1,2,...,H,  are  the 

probabilities  of  including  unit  in  the  sample  component 
5^  of  the  hth  stratum. 

Here  we  suggest  a  new  estimator  which  uses 
nonparametric  regression  and  is  based  on  the  prediction 
approach  to  survey  inference  (for  example,  see  Royall  and 
Herson,  1973).  Related  work  in  the  application  erf 
nonparametric  regression  to  sampling  may  be  found  in 
(Cheng  1994,  Dorfman  and  Hall  1992,  Jones  and  Bradbury 
1993,  Kuk  1993). 

2.  A  NEW  ESTIMATOR  OF  TOTAL 
Consider  the  model 

Yi  =  m(Xi)  +  aiXi)eiJ  =  l,..,N  (1) 

with  7w(-)  a  smooth  function  and  the  e,.  independent  with 
mean  0  and  constant  variance.  Let  K(u)  be  a  symmetric 


density  function,  for  example  the  standard  normal  density 
function.  For  a  chosen  scaling  factor  ("bandwidth")  b, 
define  K^{u)  =  b~^K(u/ b),  and  weights 

Wj(x)=^Ki,{xi-x)/^Kt{xf-x).  We  consider  the 

/  i=l 

Nadaraya-Watson  estimator  of  tn(x)  given  by 

m(x)  =  'Zy^,(x)Yr  (2) 

i 

Under  reasonable  conditions  on  m(x)  and  the  design 
points  jc,  m(x)  is  consistent  for  m(x),  as 
b  — ^  0,  Tib  — ^  00. 

If  we  let  X  =  Xj,  the  values  of  x  in  the  part  of  the 

population  which  has  not  been  sampled,  then  a  natural 
estimate  of  T  is 

As  with  prediction-based  estimators  generally,  this 
estimator  ignores  sampling  probabilities. 

The  conditional  mean  and  variance  of  T^^—T 
under  (1)  are  readily  expressed  as 

dj„-T\X,)  ^Y^^d,(x;f\nbr 

and 

var(f„p - ^p)=  X (^i )  +  X )’ 
where  d^Xj)  ={nby'Zi^K{(xf-Xj)/b}, 

and  Xp  is 

the  population  vector  of  x- values.  Note  that  d,{xj)  is  the 
standard  Nadaraya-Watson  estimator  of  d^(xj)-  We  have 

the  following  theorem  along  lines  suggested  in  (Chambers, 
Dorfiman,  and  Hall  1992),  (Dorfman  and  Hall  1992),  and 
(Ruppert  and  Wand  1993): 

Theorem  1.  Let  K(u)  be  a  symmetric  density  function 
with  J  iiK(u)du  =  0  and  k2=j  u^K(u)du  >  0;  assume  n 

and  N  increase  together  such  that  nlN-¥7i,  with 
0  <  <  1;  assume  sample  and  non-sample  values  of  x  are 

in  the  interval  \c,d\  and  are  generated  by  densities  d^  and 
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dp_^  respectively,  both  bounded  away  from  zero  on  \c,d\, 
and  assumed  to  have  continuous  first  derivatives;  let 
mix)  be  defined  as  at  (2)  above;  assume  mix)  has  a 
continuous  second  derivative,  and  let 
y0(jc)  =  dX:^m''{x)  +  2df{x)m'{x)\  then 

E(f^-T\Xp)=b'^iN-n)ik^l2)* 

^^ix)dXxy^  dp_Xx)dx 
+Op(nb^+n^^b^^)  (5) 

and 

var(t-7’lxJ  =  (iV-nfn-‘ 

*  J  ix)d^  ix)~^ldp.,  ix)f  dx 

+iN-n)n~^b~^jK^iu)du*j  (5^ix)dXix)dp_Xx)dx 
+{N  -nf  n~^b^kl^c*ix)d^  ix)dx 

+(A^ - n)j ix)dp_Xx)dx+Op{nb^  +  (6) 

We  leave  unspecified  c  *  (j:),  a  complicated  function  of  the 
derivatives  of  dXx)  dp_Xx).  Proof  of  Theorem  1. 
The  basic  mechanism  driving  the  proofs  is  that  if,  for  any 
expression  Z,  E(Ziu)=A(u)+0(B),  and  var(Z]u)=0(C),  then 
Z  =  Aiu)+Op{B+  Cf'^ ),  a  result  that  follows  from  the 

Chebychev  inequality.  In  the  following  remarks,  i,  V  etc. 
index  sample  units,  and  j,  f  nonsample  units. 

Transition  from  (3)  to  (5):  Note  that 

d,  {xj )  =  d^  {xj )  +  b^dl  {xj )  ,+Op{b^  +  n~^^b'''^ )  a  result 

that  follows  directly  from  calculation  of  the  mean  and 
variance  of  In  similar  fashion,  conditional  on  Xj, 

inb)~'  X  -  Xj  ]lb){mix, )  -  m{xj  )}= 

b^dl{xj)k2l2  +  Op{b^  +n^^b^'^),  since  the  left-hand 
expression  has  mean  b^dXxj)k^l2  +  0{lX)  and  variance 

n^b-'^ -  Xj ]lb\m{xi )  -  m(xj  )'f\Xj ^ 

-£^(A:([;r.  -  xJ/^)[m(xi)-m(x^.)]uJ}=o(n-‘h); 

the  last  equality  follows  from 

-^]lb)[m{xX-m(xj)^\x^  = 

J  A:2([w  -  j:  J/b)[m(w)  -  m(xj^  dXw)dw= 

b^  K^iuj^ubtri  (xj )  +  o[u^b^)^ 


’^d^ {xj)  +  ubd^ (xj )  +  0{u^b^ ) jrfw  Combining  the  (N- 
n)  terms  and  repeating  the  argument  leads  to  (5). 

Transition  from  (4)  to  (6):  Let  M=N-n.  The  second  term 
of  (4)  is  straightforward  to  deal  with.  The  main  task  is 
developing  an  expression  for  .  We  have 

E{wf\x.)  =  Mn-^b-^ 

M(M-  {x([;c,.  -  x]lb)dX{x^ 

* (l  +  c{xj )b^  +Op{b^  +  ))l x^ }  By  the  usual 

Taylor  expansion,  the  first  term  equals 

Mn~^b~^  J  iu)dud~^  (x,  )dp_^  (x^ ) 

+0(n~^ +n~^'^b~^'^);  the  second  term  equals 

M^n~^{dX(xi)d^_X^i)  +  k2c(xjb^}+ 

0(n~^  +b^  +  n~^'^b~^'^ ) .  Further,  in 

var(w^(x;)lj[:i)  =  £(w'‘(;c,)lj(:,.)-£'^(w^(x:,.)lA:i),  the 
dominant  terms  of  these  two  terms  cancel,  and  we  find 
var(iv^  (;Ci  )l  Jcj  =  0(n~^b~^ ) .  Combining  expressions 
yields 

wf=M^n~^  {dXixi  )4-,  (^; ) + )b^}+ 

Mn~^b~^j  K^iu)dudX{xi)dp_Xxi)+ 

Op  («"*  +b^  +  n~^'^b~^  ) .  Summing  over  i  yields  (6).  ♦ 
We  note  the  following  consequences: 

(i)  The  conditional  relative  bias  is  Op{b^  this 

goes  to  zero  so  long  as  — >  0.  (ii)  The  variance  is  Op(«) 
so  long  as  the  weak  conditions  b-^0,  oo  are  met. 

(iii)  If  b  =  Clf  for  e<— 1/4,  then  the  ratio 
£(  i^p  -  n  Xp  var'^^  (  ^  -  Tl  )  is  asymptotically  zero 

in  probability,  a  next-best-to-unbiasedness  condition  that 
allows  for  constructing  confid^ce  intervals  for  T  based  on 
estimates  of  variance;  we  note  that  the  standard  bandwidth 
b  =  Cn~^'^,  optimal  under  mean  square  error  criteria  for 
Tn(x)  itself,  is  too  large  for  the  bias  to  become  negligible,  so 
that  other  than  standard  methods  of  selecting  bandwidth 
seem  to  be  in  order,  (iv)  Under  simple  random  sampling, 
or  more  generally  when  d^ix)  =  dp_^  ( jc) ,  the  variances  of 

the  simple  random  sample  based  7^  and  T^p  are  equal  (to 
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first  order),  but  the  bias  of  7^  is  of  the  same  order 
O  as  the  root  of  the  variance,  unless  m(x)  is  a 

constant  on  [c,d];  hence  lacks  the  desirable  property 

mentioned  in  the  previous  remark.  In  the  case  of  stratified 
random  sampling,  where  a  finite  number  of  strata  grow 

without  limit,  the  bias  of  is  likewise  in  general 
unless  m(x)  is  constant  on  each  stratum.  (v) 

A. 

The  results  on  the  bias  of  hold  whether  or  not  the 

sample  and  non-sample  densities  are  the  same;  this 
suggests  that  balance  (Royall  and  Herson  1973)  plays  a  less 
important  role  with  this  estimator,  however,  we  cannot  be 
indifferent  to  the  spread  of  the  jc's,  since  the  efficiency  of 
nonparametric  regression  can  be  affected;  compare  (Chu 
and  Marron  1991).  (vi)  In  the  variance  the  implicit  term 

is  of  larger  order  than  the  explicit  o(b~^) 

term  for  b  =  Crf,  £>—1;  this  suggests  that  plug  in 
methods  for  estimating  bandwidth  based  on  (4)  and  (6) 
will  be  ineffective,  (vii)  The  condition  of  the  theorem  that 
n  is  of  the  same  order  as  N  can  be  loosened  to  n=0(N),  at 
the  price  of  complicating  the  expression  of  the  (?p(  ) 
terms  in  (4)  and  (6). 

To  the  end  of  estimating  variance,  we  follow  a 
suggestion  of  Rose  (1978)  and  define 
^{x)  =  ^{x)-fhfXx),  (7) 

where  m^{x)  is  a  pilot  estimator  of  m(x)  based  on 
bandvndth  A,  as  in  (2),  and 
m^{x)  =  {nl)~^'^Kilxi-x]ll)Y^ jd^ix)  is  a  non- 

parametric  regression  estimator  of  nij  (jc)  =  based 

on  a  possibly  different  bandwidth  /.  We  allow  h  and  I  to  be 
different 
Theorem  2.  Let 

var(f^  -  T\Xf.)  = 

(jr)  as  defined  in  (7).  Then  var(7^p  —  T\X^- 

var{f„,-nx,)= 

+T^nh)~^'^  +  /i^]  + 1) 

3.  EMPIRICAL  RESULTS 

We  consider  a  population  consisting  of  N=400 
establishments.  The  data  is  taken  fi’om  the  United  States 
Bureau  of  Labor  Statistics’  1991  Occupational 
Compensation  Survey.  The  variable  of  interest  Y  is  the 


total  wages  paid  to  workers  in  a  selected  group  of 
occupations;  x  is  the  total  number  of  workers  in  each 
establishment  including  those  in  occupations  outside  the 
selected  group.  From  this  population,  100  samples  were 
taken,  using  stratified  random  sampling  without 
replacement;  for  A=  1,2,3  ,  «;,=20  points  were  taken  from 
each  of  three  strata  of  sizes  Nf^=  202,  114,  and  84 
respectively.  Three  classes  of  company  size,  viz. 
0<x<250,  250 <;c<  1000,  and  1000<x,  determined 


the  strata. 

For  each  sample,  we  calculated  (i)  the 
nonparametric  regression-based  estimator;  (ii)  several 
design-based  estimators  of  the  total,  namely  the  expansion 
estimator,  and  the  combined  and  separate  ratio  and 
regression  estimators  (Cochran  1977);  and  poststratified 
estimators  and  (iii)  the  linear-model  based  estimator  with 
different  assumed  variance  structures.  The  auxiliary 
variable  was  log-transformed  for  the  nonparametric 
regression-based  estimator,  and  for  the  design-based 
estimators  in  some  instances.  Three  bandwidths  were  used 
which  were  judged  to  give  reasonable  results  based  on 
visual  inspection  of  fits  on  a  single  sample,  in  two  ways, 
namely,  for  immediate  use  in  the  nonparametric  estimator 

A. 

of  total  7^^ ,  or  as  seeds  to  choose  the  bandwidthin  by  the 


algorithm  of  Hardle,  Hall,  and  Marron  (1992). 

Table  1  gives  summary  results  in  the  form  of  the 

100  /a  \  / 

average  relative  error  -  jj /lOO 


average  squared  error 


|f{7;-r)yioo|"' 


(RASE),  where 


is  one  of  the  estimators  of  T  computed  for  sample  r.  In 
terms  of  the  RASE,  we  note  that  the  combined  and  separate 
regression  estimators  do  not  much  improve  the  expansion 
estimator,  and  in  fact  do  a  lot  worse  unless  the  auxiliary 
variable  is  log-transformed.  The  nonparametric  regression- 
based  estimator  is  more  efficient  (i.e.  has  smaller  average 
squared  error)  than  the  best  of  the  design-based  estimators, 
at  the  two  larger  bandwidths.  It  has  about  the  same 
efficiency  as  the  expansion  estimator  at  the  smaller 
bandwidth.  The  bandwidth  selection  procedure  does  not  do 
as  well  as  the  naked  eye. 

Greatest  efficiency  was  achieved  by  the  model- 
based  estimator  relying  on  a  linear  model,  with  variance 
assumed  proportional  to  x^,  but  there  is  a  drop  in 
efficiency  with  the  other  variance  structures  well  below  the 
nonparametric  estimator  at  larger  bandwidth.  Note  that  the 
nonparametric  regression  estimator  does  not  require  us  to 
specify  the  variance  structure.  Table  2  gives  results  on 
variance  estimation  for  the  expansion,  poststratified,  and 
nonparametric  regression  estimators.  The  mean  root  of  the 
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variance  estimates  tends  to  be  lower  than  the  RASE, 
especially  for  the  nonparametric  regression  estimator,  and 
coverage  is  low.  For  Ae  stratification  estimator  with  finest 
stratification,  the  variance  estimator  was  available  in  only 
82%  of  runs.  We  note  the  anomalous  behavior  of  the 
nonparametric  regression  variance  estimator,  which  tends 
to  get  smaller  at  smaller  bandwidths,  when  the  RASE  is 
largest  Allowing  /,  the  bandwidth  used  to  estimate  the 
m^ix)  component  of  o^(x),  to  be  chosen  independently 
worked  to  moderate  advantage  here,  somewhat  increasing 
the  average  root  variance  estimates,  and  the  coverage.  In 
one  run  for  seed(h)=  0.25,  the  variance  estimate  was 
negative.  In  practice,  one  could  have  recourse  to  forcing 
l=h,  in  such  a  case. 

4.  QUESTIONS 

The  empirical  results  suggest  that  the  nonparametric 
regression-based  estimator  of  a  finite  population  total  is  a 
strong  rival  to  established  estimators.  It  has  the  quality  of 
automaticity  we  associate  with  design-based  estimators,  but 
is  likely  to  reflect  better  the  actual  structure  of  the  data, 
yielding  greater  efficiency.  It  can  be  costly  in  computer 
power,  and  may  not  do  as  well  as  a  parametric-model  based 
estimator,  when  the  modelling  process  is  done  carefully  on 
well-behaved  data. 

Further  research  on  the  nonparametric  regression- 
based  estimator  is  needed: 

*  Automatic  bandwidth  selection.  Standard  bandwidth 
selection  methods  such  as  that  of  Hardle,  Hall,  and  Matron 
(1992)  which  we  used  in  the  simulation  study  aim  at 
estimating  a  bandwidth  that  minimizes  the  average  square 
error  of  the  ni{Xf\  i  ss.  This  bandwidth  has  the 

property  that  h  =  Cn~^'^ ,  outside  the  range  of  acceptable 
bandwidths  in  note  (ii)  of  Section  2.  It  is  in  fact  larger,  so 
that  curves  based  on  standard  methods  will  tend  to  be  too 
smooth  for  deriving  the  estimate  of  total.  One  is  tempted  to 

use  plug-in  methods  that  would  mininaize  the  MSE  of  , 

but  as  in  note  (vi)  of  Section  2,  there  seem  to  be  intrinsic 
barriers  to  this  approach. 

*  Variance  estimation.  The  results  of  the  simulation 
suggest  a  need  for  further  work  here  that  would  give  better 
coverage,  and  estimated  standard  deviations  with 

expectation  closer  to  root  mean  sauare  error  of  7^^ . 

*  Can  we  improve  by  alternatives  to  straightforward 

Nadaraya-Watson?  Many  methods  desave  serious 
consideration,  including  adaptive  bandwidth  and  local 
linear  regression  (Fan  and  Gibjels  1992).  Possibly 


estimates  or  a  priori  guesses  of  the  variance  structure  could 
profitably  be  incorporated  into  the  nonparametric 
regression  based  estimator. 

*  As  noted  in  section  2  note  (v),  the  sample  design  is  much 
less  of  a  concern  when  we  use  nonparametric  regression 
than  in  strictly  model-based  estimators.  But  because  of  the 
dangers  of  extrapolation  and  "internal  extrapolation" 
(dealing  with  holes),  this  cannot  mean  that  any  sample  is 
pemussible;  what  are  the  boundaries  of  the  permissible 
and  also  what  characterizes  samples  with  smallest  MSE? 
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Table  1.  Summary  Statistics  for  Estimators  of  Total  in  Wage  Population 
Estimator  Average  Relative  Error  Root  Average  Squared  Error/10‘  RASE(T)/  RASE(t 


stratified  expansion 

0.035^ 

6.34*’ (.42) 

1.00 

poststratified  6  strata 

0.033 

6.36  (.51) 

1.00 

9  strata 

0.023 

6.07  (.53) 

0.96 

12  strata 

0.15^* 

6.43^  (.56) 

1.01 

combined  ratio 

0.040 

6.22  (.56) 

0.98 

separate  ratio 

0.042 

6.32  (.58) 

1.00 

combined  regression 

0.070 

7.56  (.79) 

1.19 

combined  regression(log) 

0.033 

6.16  (.40) 

0.97 

separate  regression 

0.069 

7.71  (.80) 

1.22 

separate  regression  (log) 
linear  model 

0.032 

6.33  (.56) 

1.00 

o\x,)ocx'^ 

0.102 

6.72  (.39) 

1.06 

0.067 

6.94  (.62) 

1.10 

c\x,)ocxf 

-0.063 

4.56‘’(.33) 

0.72 

nonparametric  regression 

b=0.25 

0.040 

6.50  (.59) 

1.02 

6=0.50 

0.013 

5.67^’ (.42) 

0.89 

6=0.75 

0.001 

5.40*’  (.38) 

0.85 

seed(6)=0.25 

0.042 

6.58  (.60) 

1.04 

seed(6)=0.50 

0.025 

6.13  (.54) 

0.96 

seed(6)=.75 

0.018 

5.90  (.50) 

0.93 

^  Standard  deviation  for  all  entries  is  approximately  0.02.  ”  Standard  deviation  is  given  in  parentheses. 

^  The  paired  two  sample  /-test  on  the  hypothesis  /f;  -  r)  - 

-t)  I  =  0  is  significant  at  /?  =  0.05 

%asedi 

runs. 


Table  2  Summary  Statistics  for  Estimators  of  Variance  of  Total  in  Wage  Population 


Estimator 

Root  Average 

Average  Coverage-95  % 

Average 

Coverage~95% 

Squared  Error/lO* 

nominal 

nominal 

stratified 

6.34 

6.63 

91 

expansion 

poststratified  6 

6.35 

6.29 

92 

strata 

9  strata 

6.07 

6.20* 

88* 

12  strata 

6.42*’ 

5.95*’ 

88*’ 

nonparametric 

l=h=b 

h=b,  see6(l)=b 

regression 

b=0.25 

6.50 

5.36 

87 

5.46 

88 

b=0.50 

5.67 

5.67 

89 

5.90 

91 

b=0J5 

5.40 

6.20 

93 

6.33 

94 

l=h=b 

h= 

b,  seedf/)=seed(ib) 

seed(&)=0.25 

6.58 

5.59 

86 

5.61** 

88 

seed(fe)=0.50 

6.14 

5.71 

88 

5.75 

88 

seed(b)=0.75 

5.90 

5.91 

90 

6.03 

91 

^Based  on  97  runs. 

'’Based  on  95  runs. 

^Based  on  82  runs.  ^  Based  on  99  runs. 
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Abstract 

Robust  estimators  are  generally  computationally  inten¬ 
sive.  A  method  which  we  call  the  inner  products  method 
(IP)  is  examined  which  is  computationally  comparable  to 
Least  Squares  (LS)  but  is  robust  with  respect  to  outliers. 
The  algorithm  consists  of  multiplying  both  the  original 
model  function  and  the  interpolated  data  points  by  a  set 
of  “test  functions”  <^2,  •  *  *»  <^n  and  then  integrating 
each  to  form  the  inner  products.  The  two  results  are  set 
equal  to  one  another,  yielding  a  system  of  constraints  on 
the  unknown  parameters.  This  system  of  constraints  is 
then  used  to  solve  for  the  unknown  parameters.  For  lin¬ 
ear  models,  the  algorithm  requires  no  initial  estimate  of 
the  parameters,  and  the  equations  generated  are  always 
linear.  With  only  Gaussian  noise,  least  squares  (LS)  does 
slightly  better  than  IP.  When  outliers  are  introduced  IP 
does  significantly  better  than  LS  and  comparably  with 
other  robust  methods  such  as  Least  Median  of  Squares 
(LMS)  and  Iteratively  Reweighied  Least  Squares  (IRLS). 
As  with  LMS,  it  can  be  used  to  predict  model  errors. 
Data  from  the  literature,  as  well  as  simulated  data,  have 
been  used  to  evaluate  IP’s  performance.  The  algorithm 
generalizes  to,  and  has  been  applied  to,  certain  nonlinear 
models. 

Introduction 

Modeling  of  data  arises  in  numerous  applications  in  the 
natural  and  social  sciences.  Regression  analysis,  perhaps 
the  most  commonly  used  statistical  technique,  is  used  to 
fit  observed  data  to  a  theoretical  model  function.  While 
least  squares  methods  are  usually  employed,  they  break 
down  in  the  presence  of  non- Gaussian  noise.  Data  sets 
often  contain  non-normally  distributed  noise  including 
one  or  more  wild  observations  (Rousseeuw  1987,  Clancy 
1947,  Phillips  1983).  Several  robust  methods  have  been 
suggested  (Rousseeuw  1987,  Huber  1981,  Hampel  1986), 
but  the  algorithms  are  generally  computationally  inten¬ 
sive  and  often  require  initial  guesses  for  the  parameters, 
even  in  the  linear  case. 


Background 

Consider  a  set  of  n  data  points,  («,*,  y**),  which  are  to  be 
fitted  to  a  model, 

{/(®)  =  (1) 

where  the  Uj,  j  =  l,...,m  are  unknown  parameters. 
The  method  of  least-squares  (LS)  involves  minimizing 
the  function 

n 

$(ai,  ...,am)  =  ai.  •  -  • ,  am)  -  Vif-  (2) 

The  least-squares  estimator  is  a  maximum  likelihood  es¬ 
timator  of  the  found  parameters,  if  the  errors  in  the  data 
are  independent  and  normally  distributed  with  a  con¬ 
stant  standard  deviation  (Brownlee  1960).  When  the 
set  of  data  points  is  thought  to  fit  a  linear  combination 
of  more  than  one  function,  one  may  use  the  generalized 
least  squares  (GLS)  method.  The  least-squares  method 
has  been  extended  to  the  nonlinear  setting  using  the 
Levenberg-Marquardt  method  (Bates  1988,  Marquardt 
1963,  More  1977). 

For  linear  problems  several  robust  alternatives  to  LS 
estimation  have  been  proposed  which  reduce  the  influ¬ 
ence  of  outliers  (Rousseeuw  1987,  Hampel  et.  al.  1986). 
Among  them  are  M-estimators  (for  a  survey  see  Huber 
1981).  These  estimates  yield  a  system  of  equations  which 
is  typically  nonlinear  and  difficult  to  solve.  Iteratively 
reweighted  least  squares  (IRLS)  methods  are  then  used. 
For  a  review  see  Holland  1977  and  O’Leary  1990. 

Rather  than  minimizing  the  sum  of  a  function  of  the 
residuals,  the  least  median  of  squares  (LMS)  method 
minimizes  the  median  of  the  squares  of  the  residuals 
(Rousseeuw  1987).  LMS  is  robust  with  respect  to  out¬ 
liers  and  leverage  points  but  has  a  slow  convergence  rate. 

Inner  Product  Method  (IP) 

Let  E{x)  be  an  experimental  data  function  whose  do¬ 
main  is  a  finite  set  of  points  in  the  interval  [p,  ^].  Let 
M{x)  =  M{x\  oi,  02  . . . ,  am)  be  the  model  function  with 
unknown  parameters  oi ,  02, . . . ,  Um  where  x  is  defined  in 
[p,g].  Ideally, 


M{x)  =  E{x) 
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where  both  are  defined.  Thus  we  expect 

f  M{x)  dx  =  I  E{x)  dx 
Jp  Jp 

where  the  data  points  are  linearly  interpolated.  More 
generally,  if  •  -  > <l>m  are  arbitrary  integrable  func¬ 

tions  on  the  interval  [p,g]  (called  test  funciions)^  we  ex¬ 
pect, 

f  M{x)<l)i{x)dx  =  f  E{x)(f>i{x)dxy  f  =  1,2, . .  .,m. 
Jp  Jp 

.(3) 

Choosing  good  test  functions,  and  integrating  both  sides 
of  (3),  we  obtain  a  system  with  m  equations  and  m  un¬ 
knowns. 

We  now  solve  for  the  unknowns  to  obtain  the  desired 
parameters.  The  motivation  for  this  method  is  that  the 
effect  of  the  “random”  noise  in  the  data  will  be  mini¬ 
mized  after  integrating  against  an  integrable  test  func¬ 
tion. 

This  algorithm,  the  inner 'product  method  (IP)  (Sturm 
1992),  differs  from  other  robust  methods.  To  show  this 
we  define  the  following  “universal”  estimator  system  {U) 
where  0  is  the  vector  of  unknown  parameters  and  Xi  = 

(a?lt,  Xi2»  •  •  •  > 

(  \ 

=0.  (4) 

i=l 

\  / 

Here  ip,  (f>i  are  real  valued  functions  of  one  variable 
with  the  property  ip(0)  =  0.  This  last  condition  is  im¬ 
posed  in  order  that  the  true  values  for  the  parameters 
are  solutions  to  (4)  when  the  data  is  perfect. 

In  matrix  form  for  IP  (4)  becomes  =  Y,  where 
Arz 

^Xii(p2(xi)  E®*‘2<?^2(a:,)... 

y^^»10m+l(2?t)  ®i2^m-}-l(2?i)  .  .  .  X^^m+l(®»)  . 

( 

h 

0=  : 

\  b 


Note  that  LS  for  simple  regression  is  obtained  from  U 
by  specializing  as  follows:  ip{x)  =  x,  (pi{x)  =  x,(p2{x)  = 
1.  The  IRLS  method  and  the  M-estimator  method  for 
simple  regression  are  obtained  from  U  by  specializing  as 
follows:  <^i(ar)  =  x  and  <^2(2?)  =  1  and  allowing  ip  to  be 
arbitrary.  In  our  technique,  the  inner  products  method, 
we  specialize  ip{x)  =  x  but  allow  (pi  to  be  arbitrary.  A 
key  advantage  of  IP  is  that  for  all  linear  model  functions 
the  resulting  equations  are  linear.  Therefore,  there  are 
no  initial  guesses  for  the  parameters. 

If  we  restrict  the  test  functions  to  be  the  same  as  in 
LS,  then  the  IP  method  can  be  restated  as  a  minimiza¬ 
tion  problem  as  follows.  If  we  define  the  residual  vector 
r(^)  =  A0  —  Y  then  we  can  state  the  general  data-fitting 
problem  as  the  solution  to 

minf{0)=p{r{e)).  (5) 

9 

In  the  case  of  least  squares,  the  function  p  is 
p{6)=  1/2 

irrl 

For  the  IP  method,  we  define  Z  as  a  diagonal  matrix 
with  (Z(i))  along  the  diagonal,  then  the  function  p{0)  = 
where  the  weighted  norm  ||a:||^  is  defined  as 

x'^Zx. 

Augmented  test  functions 

In  order  to  find  a  better  set  of  test  functions  than  LS 
(i.e.  one  which  produces  a  smaller  error),  we  exploit  the 
fact  that  the  simple  linear  case  satisfies  the  differential 
equation  f\x)  =  0.  Hence,  if  the  data  points  (a:,-,  j/,)  are 
thought  to  lie  close  to  a  straight  line,  then  the  discrete 
second  derivative,  yi^i  —  2'yi  should  be  close  to 

zero.  Thus  if  we  set  D{i)  =  yi^i  —  2yi  H-  yi^i,  then  D{i) 
measures  how  far  y,*  is  from  the  model  function.  If  D{i) 
is  big,  then  y,*  is  far  from  the  line.  Since  D{i)  is  not 
defined  when  i  is  an  endpoint,  we  set  D{p)  to  D(p  4- 1) 
and  D{q)  =  D{q  —  1)  where  the  data  is  defined  on  the 
interval  \p,  q]. 

We  modify  the  previous  test  functions  <p  using  D{i) 
to  mollify  the  effect  of  the  outliers.  Let 

“  cZ)(i)2  +  1 

where  c  is  a  relatively  large  positive  number.  Then 
Z{i)  »  1  if  y,',y,~i  and  yj+i  are  “good”  points,  and 
Z{i)  «  0  if  yj,yi-i  or  y^+i  is  an  outlier.  Now  we  replace 
^{x)  by  (p{x)Z(x).  Note  that  the  farther  the  outlier  is 
from  the  other  data  points,  the  less  its  influence  will  be. 
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However  if  several  consecutive  outliers  happen  to  lie  on 
a  line,  their  influence  will  be  exaggerated. 

This  method  generalizes  to  higher  dimensions.  For 
purposes  of  explanation  we  will  restrict  our  attention  to 
two  dimensions.  If  the  data  is  equally  spaced  within  a 
grid,  we  replace  the  discrete  second  derivative  by  the  dis¬ 
crete  Laplacian.  In  this  case  every  interior  point  (x,i ,  a?i2) 
has  four  immediate  neighbors:  p\  =  (x,i  -f  l,x,-2))  P2  = 
(x,i  —  1,  Xi2)y  P3  =  P4  =  “  !)•  The 

four  vectors  determined  by  these  points,  Vk  =  Pfc— Poj  b  = 
satisfy  the  following  equation  of  linear  depen¬ 
dence: 

+  V2  +  V3  +  ^^4  =  0  (6) 

This  equation  of  dependence  is  then  used  in  order  to 
adjust  the  data  as  shown  above. 

In  higher  dimensions,  we  will  assume  initially  that 
the  Xi  are  arranged  in  a  cubical  lattice.  In  other  words, 
we  shall  assume  that  the  x,-  are  the  integral  lattice  points 
in  a  large  cube.  We  shall  say  that  two  such  points  are 
“close  neighbors”  if  their  difference  is  a  standard  basis 
vector;  in  other  words  their  difference  equals 
±(0, 0, . . . ,  0, 1, 0, . . . ,  0, 0).  Thus  each  point  has  2m  close 
neighbors.  Then  if  pi  is  the  response  variable  at  the  point 
Xi,  the  discrete  Laplacian  at  x,*  is  the  sum  of  the  values 
at  all  the  close  neighbors  minus  2m  times  the  value  at 
the  center. 

If  the  Pi  all  lie  on  a  plane  then  the  discrete  Laplacian 
will  be  zero.  Thus  the  discrete  Laplacian  measures  how 
far  the  data  is  from  being  perfect.  We  note  that  an  im¬ 
perfect  center  value  will  create  a  much  larger  Laplacian 
than  a  neighboring  value,  since  the  coefflcient  of  the  cen¬ 
ter  is  2m  while  the  neighboring  coefficients  are  all  one. 
Note  that  this  approach  should  improve  with  higher  di¬ 
mensions.  For  the  boundary  points  which  do  not  have 
2m  neighbors  one  can  use  fewer  neighbors.  At  the  cor¬ 
ners,  one  might  extrapolate  or  assume  that  the  corners 
are  outliers. 

If  the  data  is  not  equally  spaced,  we  do  not  have 
the  notion  of  immediate  neighbors  and  a  replacement  for 
(6)  is  required.  This  replacement  is  obtained  as  follows: 
For  every  data  point  po,  we  divide  the  plane  into  three 
regions,  by  drawing  three  rays  emanating  from  po  in  such 
a  way  that  the  angle  between  any  two  is  120  degrees. 
Then  choose  one  point  pi  from  the  first  region,  p2  from 
the  second,  and  ps  from  the  third  (this  will  be  possible 
for  “most”  points).  Let  u,*  =  pi  —po,  i  =  1, 2, 3  and  solve 
the  equation  of  dependence: 

aivi  +  a2V2  +  03^3  +  a4V4  =  0. 

The  choice  of  p,-  implies  that  all  a,-  may  be  taken  to  be 
positive.  We  normalize  by  taking  af  -f  +  ^3  =  1.  This 
is  used  as  a  replacement  for  (6). 


Numerical  Results 

Numerical  results  indicate  that  the  IP  method  is  more 
robust  than  least  squares  and  compares  favorably  with 
other  methods.  The  inner  products  method  was  ex¬ 
amined  using  data  from  the  literature.  Figure  1  shows 
calibration  data  with  least  squares(LS),  least  median  of 
squares  (LMS),  and  inner  products  (IP)  fits.  The  data 
is  taken  from  (Massart  1986).  The  true  relationship  was 
p  ■=  x\  note  that  only  IP  returns  the  exact  slope. 

Massart  et.  al.  (Massart  1986)  cite  an  application 
for  robust  estimation  to  calibration.  They  apply  both 
LS  and  LMS  to  the  same  data  set.  If  the  two  lines  do 
not  coincide,  then  LS  is  usually  pulled  by  outliers  at 
the  end  of  the  calibration  range.  As  an  example  they 
study  the  calibration  of  lead  measurements  by  plasma 
emission  spectrometry.  There  are  13  data  points,  10  of 
which  are  at  the  low  end  of  the  scale.  These  low  con¬ 
centration  points  determine  the  slope  of  the  LMS  and 
the  IP  line.  The  LS  method  fits  the  high  end  points. 
By  comparing  the  two  methods  (Figure  2),  the  model 
error  caused  by  curvature  of  the  calibration  line  is  re¬ 
vealed.  Visual  inspection  alone  would  not  have  revealed 
this.  Massart  et.  al.  make  the  point  that  although  this 
method  of  determining  model  errors  is  not  a  statistical 
test,  there  are  no  very  good  alternatives.  Analysis  of 
variance  requires  repeated  measurements  which  may  not 
be  available.  Residual  analysis  is  affected  by  outliers  in 
the  least  squares  case.  An  F-test  applied  to  this  data 
did  not  show  that  a  second-order  model  would  be  signif¬ 
icantly  better.  Therefore,  like  LMS,  IP  can  be  useful  in 
detecting  model  errors. 

Figure  3  shows  a  two  dimensional  data  set  with  two 
outliers  where  the  data  points  are  not  uniformly  spaced. 
LS  does  predictably  poorly  whereas  IP  returns  the  pa¬ 
rameters  exactly. 

Our  method  extends  to  nonlinear  models  as  follows: 
Let  D  be  a  differential  operator  which  annihilates  the 
model.  Then,  proceeding  as  in  the  linear  case,  we  re¬ 
place  the  test  function  <l>  by  <t>Z.  For  example,  for  an 
exponential  model,  p  =  ae^^,  ln(p)  =  ln(a)  -{-  Ax.  So, 
^  =  A  =>  —  0.  So  we  take, 

Dit)  =  D2{t)yi  - 

where  Di  is  the  discrete  first  derivative,  and  D2  is  the 
discrete  second  derivative.  We  have  also  shown  [Sturm 
1994]  how  the  inner  product  method  can  be  applied  to  a 
nonlinear  model  for  three  dimensional  eye  movements. 

Conclusions 

We  have  shown  that  the  integral  inner  products  method 
is  more  robust  than  least  squares  for  simple  and  multi- 
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variate  model  functions.  It  compares  favorably  to  more 
robust  methods  such  as  LMS  and  IRLS  for  outlying  re¬ 
sponse  variables  and  is  computationally  simpler.  For 
linear  models,  no  initial  guesses  for  the  parameters  are 
required.  IP  extends  naturally  to  nonlinear  models  as 
well. 


Concentration 


Figure  1.  Calibration  Data.  True  relationship:  y  =  x. 
LS:  1.26a: -0.48 
IP:  l.Ox  -  0.37 
IRLS:  .92X  +  .19 
LMS:  .90x-f  .20 
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Figure  2.  Calibration  of  lead  measurements  by  plasma 
emission  spectrometry.  The  difference  between  LS  and 
IP  shows  a  possible  model  error. 
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variate  model  functions.  It  compares  favorably  to  more 
robust  methods  such  as  LMS  and  IRLS  for  outlying  re¬ 
sponse  variables  and  is  computationally  simpler.  For 
linear  models,  no  initial  guesses  for  the  parameters  are 
required.  IP  extends  naturally  to  nonlinear  models  as 
well. 
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Figure  1.  Calibration  Data.  True  relationship:  y  =  a:. 
LS:  1.26^-0.48 
IP:  1.0a: -0.37 
IRLS:  .92a: +.19 
LMS:  .90a: +  .20 


Figure  2.  Calibration  of  lead  measurements  by  plasma 
emission  spectrometry.  The  difference  between  LS  and 
IP  shows  a  possible  model  error. 


0 

10 

0 

12 

0 

0 

0 

0 

8 

0 

0 

11 

0 

13 

0 

15 

0 

8 

9 

0 

11 

12 

13 

0 

0 

7 

0 

9 

0 

11 

12 

0 

5 

0 

7 

(15) 

9 

0 

0 

0 

0 

5 

0 

7 

0 

9 

(-3) 

11 

0 

4 

5 

0 

0 

8 

0 

10 

2 

0 

0 

0 

6 

0 

8 

0 

Figure  3.  Multivariate  data,  y 

=  Zl  +  X2. 

Outliers  are 

shown  in  parentheses. 

LS:  .76xi  +  1.08a:2  +  .45 
IP:  Xi  +  X2 


References 

Barrowdale,  I.  and  Young,  A.,  1965,  Algorithms  for  Best 
Lx  and  Loo  Linear  Approximations  on  a  Discrete  Sei^ 
Numerische  Mathematik,  8,  pp.  295-306. 

Bates,  D.M.,  and  Watts,  D.G.,  1988,  Nonlinear  Re¬ 
gression  Analysis  and  its  Applications,  John  Wiley, 
New  York. 

Bloomfield,  P.  and  Steiger,  W.L.,  1983,  Least  Absolute 
Deviations,  Birkhauser,  Boston. 

Brownlee,  K.A.,  1960,  Statistical  Theory  and  Method¬ 
ology  in  Science  and  Engineering,  John  Wiley,  New 
York. 

Claerbout,  J.F.,  and  Muir,  F.,  1973,  Robust  Modeling 
With  Erratic  Data,  Geophysics,  38:5,  pp.  826-844. 

Clancy,  V.  J.,  1947,  ,  Nature,  159,  pp.  339-340. 

Cook,  R.  Dennis,  1977,  Detection  of  Influential  Obser¬ 
vation  in  Linear  Regression,  Technometrics,  19:1,  pp. 
15-18. 

Daniel,  C.,  and  Wood,  F.S.,  1980,  Fitting  Equations 
to  Data,  John  Wiley,  New  York. 

Hampel,  R.H.,  Ronchetti,  E.M.,  Rousseeuw,  P.J.,  and 
Stahel,  W.A.,  1986,  Robust  Statistics,  John  Wiley, 
New  York. 

Holland,  P.W.  and  Welsch,  R.E.,  1977,  Robust  regres¬ 
sion  using  iteratively  reweighted  least  squares,  Commun. 
Statist.,  A6,  pp.  813-888. 

Huber,  P.J.,  1964,  Robust  Estimation  of  a  Location  Pa¬ 
rameter,  Annals  Math.  Statist.,  35,  pp.  73-101. 

Huber,  P.J.,  1981,  Robust  Statistics,  John  Wiley,  New 
York. 


212  Jump  and  Sharp  Detection  by  Wavelets 


Jump  and  sharp  cusp  detection  by  wavelets  with 
applications  to  estimation  of  functions  with  jumps 

By  YAZHEN  WANG 

Department  of  Statistics,  University  of  Missouri- Columbia, 

Columbia,  MO  65211,  U.S.A. 


Abstract 

A  wavelet  method  is  proposed  to  detect  jumps 
and  sharp  cusps  in  a  function  which  is  observed 
with  noises.  We  detect  jumps  and  sharp  cusps  by 
checking  if  the  wavelet  transformation  of  the  data 
has  significantly  large  absolute  values  at  fine  scales. 
The  theory  and  fast  algorithm  for  the  detection  are 
established.  For  estimating  a  function  with  jumps, 
jump  detection  is  used  in  construction  of  a  wavelet 
estimate  of  the  function.  The  estimate  has  better 
visual  quality  than  the  direct  threshold  wavelet  es¬ 
timate  (e.g.  VisuShrink  estimate). 

1.  Introduction 

The  recently  developed  theory  of  wavelets  has 
drawn  much  attention  from  both  mathematicians, 
statisticiajis  and  engineers.  Orthonormal  bases 
of  compactly  supported  wavelets  have  been  used 
to  estimate  functions  (see  Donoho  and  Johnstone 
(1992a,  b)).  The  theory  of  wavelets  permits  de¬ 
composition  of  functions  into  localized  oscillating 
components.  This  provides  an  ideal  tool  to  study 
localized  changes  such  as  jumps  and  sharp  cusps  in 
one  dimension  as  well  as  several  dimensions.  This 
paper  describes  only  results  about  one  dimension 
case  in  Wang  (1994a). 

The  detection  techniques  are  applied  to  estima¬ 
tion  of  a  function  with  jumps.  In  the  seminal 
work  of  Donoho  (1993a,  b),  Donoho  and  Johnstone 
(1992a,  b,  c)  and  Donoho,  Johnstone,  Kerkyachax- 
ian  and  Picard  (1993),  orthonormal  bases  of  com¬ 
pactly  supported  wavelets  are  introduced  to  esti¬ 
mate  a  function.  The  estimate  is  the  reconstruction 


of  the  thresholded  empirical  wavelet  coefficients  of 
the  data.  This  simple  estimate  enjoys  a  wide  va¬ 
riety  of  spatial  adaptivity  and  theoretical  optimal¬ 
ity.  If  the  function  has  jumps,  however,  the  esti¬ 
mate  will  have  an  annoying  visual  appearance  -  the 
reconstruction  exhibits  many  undesirable  spurious 
oscillations  near  jump  locations.  Donoho  (1993b) 
considered  segmented  multiresolution  analysis  to 
remedy  this  drawback.  The  phenomenon  also  hap¬ 
pens  in  digital  image  compression.  Because  digital 
image  has  sharp  variations  along  its  edge  curves, 
compression  by  directly  thresholding  wavelet  coef¬ 
ficients  results  in  oscillations  near  the  edge  curves. 
These  oscillations  produce  so  called  Gibbs  errors 
at  the  locations  of  the  edge  curves,  which  degrade 
considerably  the  image  quality  (see  Froment  and 
MaUat  (1992)).  The  reason  for  the  phenomenon  is 
explained  as  follows.  When  the  underlying  func¬ 
tion  has  sudden  changes  such  as  jumps,  the  wavelet 
coefficients  are  reflected  by  “large”  wavelet  coeffi¬ 
cients  at  fine  scales  near  the  change  locations.  After 
thresholding,  only  these  “large”  wavelet  coefficients 
remain  at  fine  levels,  and  then  artificial  oscillations 
which  resemble  the  mother  wavelet  appear  in  the 
reconstruction.  The  approach  here  is  to  divide  sup¬ 
port  of  the  function  into  several  blocks  according 
to  the  detected  jumps  and  then  use  boundary  cor¬ 
rected  wavelets  to  estimate  the  function  on  each 
of  the  blocks.  The  estimate  has  a  visual  advan¬ 
tage  that  there  are  no  annoying  oscillations  near 
the  jumps. 

The  rest  of  this  paper  is  organized  as  follows.  Sec¬ 
tions  2  and  3  introduce  the  white  noise  model  and 
wavelet  transformation,  respectively.  Testing  hy- 
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potheses  and  estimation  are  considered  in  Sections 
4  and  5,  respectively.  Section  6  discusses  imple¬ 
mentation  of  the  method  in  practice  and  Section  7 
features  an  application  of  jump  detection  to  esti¬ 
mation  of  functions  with  jumps.  Section  7  presents 
discussion. 

2.  The  white  noise  model 

Suppose  /  is  observed  from  the  white  poise  model 

Y{dx)  -  f(x)dx  +  TW{dx),  a:  €[0,1],  (1) 

where  VF  is  a  standard  Wiener  process,  and  r  is  a 
formal  noise  level  parameter  which  we  think  of  as 
small,  /  is  an  unknown  function  which  may  have 
jumps  and  sharp  cusps.  The  problem  is  to  detect 
these  jumps  and  cusps. 

We  say  a  function  /  has  an  cc-cusp  (0  <  a  <  1) 
at  a;o  if  there  exists  a  positive  constant  K  such  that 
as  /i  1  0  or  h  1  0, 

\f{xo  +  h)-fixo)\>K\hr-  (2) 

For  the  case  a  =  0,  /  has  a  jump  at 

The  white  noise  model  (1)  is  closely  related  to 
the  following  nonparametric  regression  model: 

Vi  =  fixi)  +  (TZi,  i  =  l,...,n,  (3) 

with  Xi  =  if  n,  Zi  a  standard  normal  error  and  <t  >  0 
parameter,  /  an  unknown  function.  Define  the  re¬ 
gression  process  {Yn{x)  :  a;  G  [0, 1]}  via  xq  =  0, 
yn(0)  =  0  and  Yn{xi)  =  yi  +  ...  +  yi,  i  = 
with  interpolation  between  the  x,-  by  Wiener  pro¬ 
cess  W  for  Xi  <  X  <  Xi+i.  Then  y„  is  a  white 
noise  process  with  the  function  fn{x)  =  /(®i)  for 
Xi  <  X  <  Xi+i  and  t  =  a  (see  Donoho  and 
Johnstone  (1992a)). 

3.  Wavelet  transformation 

Let  Ip  be  Daubechies  “mother  wavelet”  (see  Chui 
(1992),'  Daubechies  (1992)  and  Donoho  and  John¬ 
stone  (1992a, b)),  and  define  tps{x)  =  'tp(x/s). 

The  wavelet  transformation  of  /  is  defined  as 
Tf{s,x)  =  f  •<ps{x  —  u)f{u)du.  The  wavelet  trans¬ 
formation  Tf{s,x)  is  a  function  of  the  scale  (fre¬ 
quency)  s  and  the  spatial  position  (time)  x.  The 


plane  defined  by  the  pair  of  variables  (s,  x)  is  called 
the  scale-space  (or  time-frequency)  plane. 

For  compactly  supported  wavelets,  the  value  of 
Tf{s,x)  depends  upon  the  value  of  /  in  a  neigh¬ 
borhood  of  X  of  size  proportional  to  the  scale  s.  At 
small  scales,  Tf{s,  x)  provides  localized  information 
such  as  local  regularity  on  /(x).  For  example,  if  /  is 
differentiable  at  x,  Tf{s,x)  has  the  order  and 
if  /  has  an  a-cusp  at  x,  the  maximum  of  Tf{s,x) 
over  a  neighborhood  of  x  of  size  proportional  to 
the  scale  s  converges  to  zero  at  a  rate  no  fast  than 
5“+i/2  as  s  tends  to  zero  (see  Daubechies  (1992)). 

The  wavelet  transformation  of  the  white  noise 
W(dx)  is  define  to  be  TW{s^x)  =  J — 
u)W{du).  The  wavelet  transformation  of  y  is 

ry(s,x)  =  f  ■ips{x-u)Y(du)  =  Tf{3,x)-\-TTW{s,x). 

(4) 

At  a  given  scale  s,  TW {s,  x)  is  a  stationary  Gaus¬ 
sian  process  with  zero  mean  and  covariance  function 

E{TW{s,  x)TW{s,  y))  =  /  ips{x  -  u)rps(y  -  u)du, 

(5) 

and 

var{TW(s,x))  =  J  [rP(u)fdu  =  1. 

Note  that  TW{s,x)  follows  a  standard  normal 
distribution  and  that  the  orders  of  T/(s,x)  axe,  re¬ 
spectively,  and  for  the  two  cases  that 

/(x)  has  an  a-cusp  at  x  and  /(x)  is  differentiable 
at  X.  By  (4)  we  can  see  that,  at  a  very  fine  scale 
s,  TY(s,  x)  is  dominated  by  tTW(s,  x),  while  at  a 
coarse  scale  s,  Tf(s,x)  dominates  TY{s,x).  Since 
the  localized  information  of  /(x)  is  provided  by 
Tf{s,  x)  at  fine  scale,  if  the  scale  s  is  too  large,  the 
wavelet  transformation  can  not  detect  local  changes 
with  enough  precision.  Our  idea  is  to  select  fine 
scales  Sr  such  that  at  those  x  where  /(x)  is  differen¬ 
tiable,  the  orders  of  Tf{sT,x)  and  TTW{Sr,x)  are 
balanced.  If  /  has  sharp  cusps,  for  x  near  the  loca¬ 
tions  of  the  sharp  cusps,  Ty  (s,  x)  will  be  dominated 
by  Tf{s,  x)  for  s>  Sr  and  hence  significantly  larger 
than  the  others.  Therefore,  the  sharp  cusps  wiU  be 
detected  by  the  wavelet  transformation  Ty  {s,  x)  at 
the  scale  levels  s'>  Sr. 
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Throughout  this  paper,  take  rj  to  be  any  constant 
that  great  than  1,  and  let  Sr  be  a  constant  with  the 
exact  order  (r^  \log  Denote  by  suppiytj)) 

the  support  of 

4.  Testing  HYPOTHESES 

Consider  the  testing  problem  Hq-.  f  is  differen¬ 
tiable  against  Hi:  f  has  Oj-cusps,  i  = 
g  >  1,  where  a{  <  a  <  1. 

Our  test  statistic  is  the  maximum  of  TY(sr,x) 
over  0  <  X  <  1.  Given  a  type  I  error  7,  under  Hq 
the  critical  value  Cr,-y  satisfies 

max  \TY{sr,x)\  >  Cr,^}  =  7- 

Theorem  1  ^  0  <  7  <  1,  then  under  Hq, 
limprfmax  |ry(sT-,x)|  >  Gr-v}  =  7- 

T->0^  '■0<j:<i'  V  j  /I  -  HJ 

where 

Cr,^  =  T{2\logs-r\)~'^/'^  [2\logSr\ 

+iog{i J [ip'iu)]'^  / {2  TC)} 

-log{-log{l-'f)/2)].  (6) 

The  proof  of  Theorem  1  is  given  in  Wang  (1994a). 
Theorem  1  provides  a  test  to  check  if  the  underlying 
function  is  smooth  or  has  jumps  or  sharp  cusps. 

5.  Estimation 

Because  of  space  we  discuss  only  the  case  that  / 
has  one  jump  or  sharp  cusp.  See  Wang  (1994)  for 
multiple  jump  and  sharp  cusp  detection  with  known 
and  unknown  number  of  jumps  and  sharp  cusps. 

Suppose  /  has  an  a- cusp  at  0  and  is  differentiable 
elsewhere.  An  estimate  of  0  is  the  location  of  the 
maximum  of  |TY’(sr,a^)l  over  0  <  x  <  1,  that  is 

e  =  Arg  m^^{\TY{sr,x)\}.  (7) 

Theorem  2 

lim  pr{  s~^  (0  —  ff)  €  suppfifi) }  =  1. 

The  compact  support  of  Tp  implies  that  the  estimate 
0  has  the  convergence  rate 


Moreover,  suppose  that  f(x)  =  f(9)+Ai\x—9\^  + 
o(|x  — 0|^)  as  X  9  if  a  >  0,  and  f(9+)  —  f{9-)  = 
A2  if  a  —  0,  where  Ai  ^  0,  i  =  1,2.  Then,  as 
r  0,  {9  -  9)1  Sr  converges  in  probability  to  the 
location  of  the  maximum  of{\f  'ip{u  —  t)  \u\^  du  \  : 
t  e  supp('ip)}  ifa>0,  and  \  f  'ip(u  —  t)  sign(u)  du  | : 
t  €  supp('ip)}  if  a  =  0. 

The  proof  of  Theorem  2  is  given  in  Wang  (1994a). 
Theorem  2  establishes  asymptotics  for  the  detec¬ 
tion.  Since  jump  detection  has  been  studied  in  the 
nonparametric  regression  setting,  we  compare  the 
convergence  rate  with  those  in  the  literature.  By 
Theorem  2,  for  the  white  noise  model  and  the  jump 
point  case,  a  =  0,  the  convergence  rate  is  \logT\'^. 
Using  the  relation  between  the  models  (1)  and  (3) 
described  in  Section  2  and  letting  r  =  <y/y/n,  we  ob¬ 
tain  this  rate  corresponds  to  the  rate  (log  n)^ 
for  the  nonparametric  regression  model,  which  are 
known  to  be  the  best  possible  convergence  rates  (see 
Muller  (1992)).  So  the  convergence  rate  Sr  is  the 
best  possible  rate  and  the  wavelet  method  is  theo¬ 
retically  optimal. 

6.  Implementation  in  practice 

In  practice,  we  may  not  have  a  realization  of  F(a:) 
at  aU  points  x  but  discrete  observations  Y(i/n), 
i  =  l,...,n  =  2^.  Or  equivalently,  we  observe  / 
from  the  model  (3),  that  is,  yi  =  f(i/n)+azi,  a  >  0, 
Zi  N(Q,  1)>  ^  =  1, . . , , n  =  2*^.  The  data yi,  -  ,yn 
are  discrete  and  consequently  a  discrete  version  of 
the  wavelet  transformation  must  be  performed. 

The  discrete  wavelet  transformation  (DWT)  can 
be  written  as  a  n-by-n  orthogonal  matrix  W 
which  depends  on  parameters  M  (number  of  van¬ 
ishing  moments),  S  (support  width),  jq  (Low- 
resolution  cutoff),  and  boundary  adjustments  (see 
Cohen,  Daubechies,  Jawerth,  and  Vial  (1993)  and 
Daubechies  (1994)).  The  rows  of  W  correspond  to 
discretized  version  of  the  wavelets  'ipx.  Denote  by 
Wjk(i)  the  i^^  element  of  the  (j, /?)*^  row  of  >V. 
Then  \/nWjk(i)  «  2^l'^'ip(2h)  for  t  =  i/n  — 

(see  Donoho  and  Johnstone  (1992b)). 

Let  2/  =  (j/i,  •  •  •  >  Vn)-  The  DWT  of  the  data  y  is 
given  by  n;  =  Wy.  Because  W  is  orthogonal,  the 
inverse  DWT  is  easy  and  y  is  recovered  from  w, 
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that  is,  y  =  W^w.  MaHat’s  pyramidal  algorithm 
(see  Chui  (1992)  and  Daubechies  (1992))  requires 
only  0{n)  operations  for  computing  DWT  and  re¬ 
construction  of  DWT. 

Using  Mallat’s  pyramid  algorithm,  we  can  com¬ 
pute  the  discrete  wavelet  transformation  of  the  data 
(j/t).  The  elements  of  w  are  indexed  dyadically 
as  follows:  w_i,o  and  Wj^k,k  =  0, . .  .,2'^  —  l,y  = 
0, -  1.  Wj^k  are  corresponding  to  DWT  at 
the  scale  levels  and  spatial  positions  k2~^, 
k  =  0, ...,2^  -  1,  i  -  1.  These  dis¬ 

crete  wavelet  transformations  are  also  called  em¬ 
pirical  wavelet  coefficients. 

In  practice,  for  a  given  data  set,  we  can  use  Mal¬ 
lat’s  pyramid  algorithm  to  compute  the  wavelet 
coefficients  and  then  carry  out  the  detection  (see 
Wang  (1994a)  for  simulations  and  real  examples). 

7.  An  application  to  estimation  of  a  function 

WITH  JUMPS 

Suppose  we  are  given  n  noisy  data  of  /  from  the 
regression  model  (3).  The  unknown  function  /  has 
jumps  and  we  want  to  recover  /.  The  direct  thresh¬ 
old  estimate  (e.g.  VisuShrink  estimate)  will  have 
many  undesirable  spurious  oscillations  near  jump 
locations  (see  Donoho  and  Johnstone  (1992b,c)). 
The  approach  here  is  to  divide  [0, 1]  into  several 
blocks  according  to  the  detected  jumps  and  then  use 
boundary  corrected  wavelets  to  estimate  the  func¬ 
tion  on  each  of  the  blocks. 

Suppose  /  has  q  jumps  at  0e,i  =  1,. .  .,q,  where  q 
is  a  finite  integer.  For  simplicity,  suppose  q  is  known 
(see  Wang  (1994a)  for  the  case  that  q  is  unknown). 
Let  be  the  estimated  locations  of  the 

jumps  and  let  =  [6t+K  log n In,  —K  logniri], 

I  =  !,•••, g. 

The  data  are  divided  into  q  blocks:  yr  =  {yi  : 

S  Ii\,  I  =  l,---,g.  On  each  of  the  Ii, 
1  =  1, ••-,q,  using  boundary  corrected  wavelets,  we 
compute  the  wavelet  coefficients  of  yr.  Because  of 
the  partition,  no  “large”  wavelet  coefficients  of  yt 
win  appear  at  fine  levels.  On  each  of  the  It,  we  ap¬ 
ply  the  VisuShrink  method  (see  Donoho  and  John¬ 
stone  (1992b))  to  the  data  yt  to  obtain  an  estimate 
{fe{xi)  :  Xi  €  It}  of  the  function  {/(^i)  :  rj  €  It}- 
The  estimate  /„  of  /  on  [0, 1]  is  obtained  by  pasting 


{ft{xi) :  Xi  £  It}  together.  The  visual  advantage  of 
the  approach  is  that  /„  has  no  annoying  oscillations 
near  the  jumps  (see  Wang  (1994a)  for  simulations 
and  real  examples). 

8.  Discussion 

It  is  very  important  to  point  out  that  the  detec¬ 
tion  by  wavelets  has  the  nature  of  multiresolution. 
In  the  detection  we  check  empirical  wavelet  coef¬ 
ficients  across  resolution  levels  and  locate  jumps 
and  sharp  cusps  by  empirical  wavelet  coefficients 
at  these  levels.  Like  smoothing  methods  with  vari¬ 
able  smoothing  parameters,  the  multiresolution  ap¬ 
proach  has  spatial  adaptivity.  The  detection  with 
spatial  adaptivity  has  many  advantages  over  trar 
dition  methods  such  as  kernel  methods  with  fixed 
bandwidth.  For  example,  a  jump  is  easier  to  de¬ 
tect  than  a  sharp  cusp;  because  of  multiresolution, 
for  fixed  sample  size  and  fixed  signal  -  to  noise  ra¬ 
tio,  the  jump  can  be  located  more  accurately  by 
wavelet  coefficients  in  higher  resolution  levels  while 
the  cusp  is  detected  by  wavelet  coefficients  in  lower 
resolution  levels. 

As  a  summary,  detection  by  wavelets  enjoys  theo¬ 
retical  optimality  and  has  fast  computational  algo¬ 
rithms  and  can  be  easily  implemented  in  practice. 
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1  Abstract 


3  Newton-Raphson  Method 


Fitting  distributions  to  data  has  many  applications.  First,  most 
analyses  make  assumptions  about  the  distribution  of  the  data  or 
the  residuals,  and  these  assumptions  should  be  checked  whenever 
possible.  Second,  inferences  about  the  fraction  of  observations 
above  or  below  some  particular  point  are  often  needed,  and 
modeling  the  distribution  may  be  the  best  way  to  obtain  these 
estimates.  Third,  biological  models  often  require  the  parameter 
estimates  from  a  distribution.  There  are  several  numerical 
methods  used  in  distribution  fitting,  and  this  paper  will  discuss 
three:  the  standard  Newton-Raphson  method,  a  modified 
Gauss-Newton  method,  and  the  EM  Algorithm. 


2  Introduction 

This  paper  restricts  its  discussion  to  maximum  likelihood 
techniques.  Distributions  with  range  parameters  to  be  estimated 
are  not  considered  because  of  the  difficulties  in  estimating  these 
parameters  using  maximum  likelihood  methods.  Several  differ¬ 
ent  numerical  methods  are  commonly  used  to  derive  estimates, 
and  this  paper  will  discuss  three  of  these:  the  standard 
Newton-Raphson  method,  a  modified  Gauss-Newton  method, 
and  the  EM  Algorithm.  The  performance  of  these  methods  for 
problems  of  distribution  fitting  will  be  discussed. 

Even  with  these  restrictions,  there  are  several  problems  in 
distribution  fitting  which  are  challenging.  These  include 
grouping  and  censoring  of  data,  truncation  of  distributions,  and 
mixtures  of  distributions.  Truncation  differs  from  censoring  in 
that  the  values  below  (or  above)  some  cutoff  point  are  never 
seen.  For  example,  lengths  of  fish  gathered  by  net  will  not 
include  fish  below  a  certain  size  since  those  fish  pass  through  the 
net.  The  number  of  fish  below  this  size  is  never  known.  With 
censored  data,  information  is  available  on  the  number  below  the 
limit.  Censoring  is  a  common  problem  with  environmental 
pollutants  which  often  exist  in  concentrations  below  the  mini¬ 
mum  detectable  limit  of  the  measuring  instrument  (left  censor¬ 
ing).  Censoring  also  occurs  in  time-to-tumor  studies  where 
some  animals  die  before  they  get  a  tumor  (right  censoring). 
Grouping  occurs  when  the  number  of  observations  is  so  large 
that  it  is  impractical  to  retain  the  individual  values. 


The  Newton-Raphson  method  is  described  in  every 
book  on  numerical  analysis  (e.g.,  see  Burden  and 
Faires,  1985).  Unfortunately,  it  performs  rather  poorly 
for  many  distributional  fitting  problems.  However,  one 
particular  problem  for  which  it  is  very  useful  is  in 
estimating  the  inverse  of  a  cumulative  distribution 
function,  F.  Assume  that  we  want  to  solve  for  x  given 
the  probability  p  and  the  parameters  a  and  jS: 

JC  =  F-‘0»,a.p) 

or  equivalently  F(jc,a,p)  -  p  =  0. 

The  Newton-Raphson  iteration  scheme  becomes 


-  p 


where  denotes  the  k*  estimate  of  x  and  /  is  the 
probability  density  function. 

As  an  example,  consider  the  problem  of  evaluating 
the  inverse  incomplete  beta  function.  The  incomplete 
beta  function  is  given  by 


'  ' '  /  WTO) 

where  r(a)  is  the  gamma  function.  The  problem  is  to 
solve  for  JC  when  p  is  specified.  Assume  p  =  0.025,  a 
=  3.5,  and  jS  ~  12.5.  Using  the  approximate  formula 
in  Abramowitz  and  Stegun  (p.  945)  for  an  initial 
estimate,  the  following  are  the  estimates  by  iteration. 


Iteration 

X 

Ffx.aS  -  p 

0 

0.06643 

0.0090519 

1 

0.06033 

0.0006947 

2 

0.05978 

0.0000056 

3 

0.05977 

0.0000000 

This  is  the  standard  quadratic  convergence  expected 
from  the  Newton-Raphson  method. 
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4  Modified  Gauss-Newton  Method  /(jc)  =  where  p  >  o. 

A  very  good  general  method  for  obtaining  a  maximum  likelihood  It  is  often  derived  from  a  differential  equation.  The 

estimates  of  the  parameters  of  a  distribution  is  a  modified  cumulative  logistic  distribution  is  defined  as 

Gauss-Newton  method.  This  method  is  normally  applied  to  least  =  1/[1  + 

squares  estimation,  but  it  has  some  advantages  for  maximum 

likelihood  estimation.  There  appear  to  be  no  references  which  It  is  commonly  used  in  dose-response  analysis,  but  the 

describe  the  procedure  given  below  exactly  —  the  closest  is  that  tise  of  the  logistic  distribution  for  distribution  fitting  is 

of  Bemdt  et  al.  (1974).  The  following  notation  will  be  used  in  less  common.  The  distribution  is  similar  in  shape  to 

the  estimation  formulas.  Let  the  i*  observation  be  denoted  by  normal  distribution,  but  has  more  mass  in  the  tails. 

In  general,  the  log-likelihood  function  for  continuous  data  can  be  Thus  it  may  be  a  good  alternative  to  the  normal  distri- 
written  as  bution  in  situations  where  the  data  are  symmetric,  but 

H  the  tails  are  heavier. 

L  =  Log(f^) ,  In  order  to  use  the  modified  Gauss-Newton  method, 

<  *  1  the  required  partial  derivatives  are: 

where/  is  the  density  function  for  the  distribution  evaluated  at  I  (l  -  ^ 

fM-  If  a  single  distribution  is  truncated,  then /  is  given  by  “  “p  |  j  ’ 


fi  -mtlFiB)  -  F(A)],  where 

A  is  the  left  truncation  point,  B  is  the  right  truncation  point,  and 
f(Xi)  is  the  complete  (not  truncated)  distribution  evaluated  at 
and  F(x)  is  the  cumulative  distribution  function.  The  first  partial 
derivative  can  be  written  as 

3f(g)  _  dFjA) 

glog(/p  _  ee  _  36  36 

36  ■  fix)  ■  FiB)  -  F(4) 

The  method  involves  estimating  the  expected  values  of  the 
second  partial  derivatives  using  first  partial  derivatives.  Using 
the  results  of  Cram6r  (1946,  p.  502),  we  know  that 

J5[VLf  =  -£[V(Viy]. 

where  v  is  the  gradient  function.  This  suggests  replacing  the 
matrix  of  second  partial  derivatives  with  the  expected  value  of 
the  first  partial  derivatives  squared.  The  calculation  of  the 
expected  value  is  difficult,  but  we  can  finesse  this  problem  by 
using  the  observed  sample  distribution  instead  of  taking  expecta¬ 
tions.  Asymptotically,  the  distribution  of  our  sample  will 


dm 


1 

p 


1  -  e -»-«)/!»  ^ 

p(l  +  e 


m. 


^  =  -fix)  and 
oa 


SF  __  -(X  -  tt)/(jc) 

P 


The  asymptotic  covariance  matrix  can  also  be  calculated 
from  the  same  inverse  matrix  of  partial  derivatives. 

To  illustrate  the  method,  consider  the  data  of 
Kenyon,  Scheffer,  and  Chapman  (1954)  on  the  Pribilof 
fiir-seal  herd  in  Alaska.  The  herd  has  been  on  the 
verge  of  extermination  several  times.  During  the 
commercial  sealing  season,  male  seals  whose  length  (tip 
of  snout  to  base  of  tail)  is  between  41  and  45  inches  (to 
the  nearest  inch)  are  killed  (clubbed  to  death).  Note 
that  the  data  are  naturally  truncated  at  40.5  and  45.5 
inches  (see  Figure  1).  ITie  estimates  of  alpha,  beta, 
and  the  log-likelihood  by  iteration  are  in  the  following 
table. 


approach  the  true  (but  unknown)  distribution.  Furthermore,  the 
sum  of  cross  products  of  the  first  partial  derivatives  will  be 

Iteration 

a 

P 

log-likelihood 

guaranteed  to  be  positive  definite.  The  estimated  matrix  is 
inverted  and  used  in  the  standard  Newton  formulation.  For 

0 

42.328 

0.8398 

-651.3232 

some  problems,  the  change  is  estimates  may  be  too  large, 

1 

41.960 

0.9558 

-587.3418 

resulting  in  a  decrease  rather  than  an  increase  in  the  likelihood. 

2 

41.835 

0.9580 

-586.6463 

In  those  cases,  it  is  necessary  to  successively  chop  the  change  in 

3 

41.846 

0.9483 

-586.6409 

half  until  the  likelihood  is  increased.  Iteration  is  stopped  when 

4 

41.843 

0.9497 

-586.6407 

the  change  in  the  parameter  values  is  arbitrarily  small. 

5 

41.844 

0.9494 

-586.6407 

To  demonstrate  the  method,  consider  the  truncated  logistic 

distribution  (also  known  as  the  sech-squared  distribution).  One  Note  the  rapid  convergence,  comparable  to  standard 
form  of  the  density  is  given  by  quadratic  methods.  The  actual  fit  is  shown  Figure  1. 


V.  Hasselblad 


219 


Figure  1.  Rt  of  a  truncated  logistic  distribution  to  lengths  of 
fur  seals  in  the  Pribilof  Islands. 


5  EM  Algorithm 

Another  method  for  estimating  the  parameters  of  a  distribution 
is  the  EM  algorithm  as  described  by  Dempster,  Laird,  and  Rubin 
(1977).  This  method  works  well  for  the  problems  created  by 
mixtures  and  grouping,  and  can  be  described  as  follows 

E  step:  estimate  the  expected  values  of  the  sufficient  statistics  for 
the  missing  data  using  the  current  estimates  of  the  parameters. 

M  step:  recompute  the  estimates  the  parameters  from  the 
expected  values  of  the  sufficient  statistics.  Iterate  by  returning 
to  die  E  step  until  the  desired  accuracy  is  attained. 

Grouped  data  are  common  and  yet  standard  Shephard 
correction  estimates  can  be  quite  biased.  As  an  example, 
consider  the  grouped  blood  lead  data  shown  in  Figure  2.  The 
city  of  New  Yoik  conducted  a  blood  lead  screening  program  in 
children  for  several  years  to  prevent  lead  poisoning  from  paint 
chips  (Billick  et  al. ,  1970).  A  detailed  discussion  of  the  analysis 
of  this  data  set  was  given  by  Hasselblad  et  al.,  1980.  Because 
the  data  were  collected  for  screening  purposes,  the  actual  blood 
lead  values  (in  micrograms  per  deciliter)  were  not  recorded,  but 
were  gioup^  into  ten  unit  intervals  (except  for  the  first  inter¬ 
val).  The  data  from  the  years  1970  to  1973  showed  both  larger 
means  and  larger  standard  deviations  than  did  data  from  later 
years.  The  larger  means  were  undoubtedly  due  to  the  lead  in  air 
and  dust  resulting  from  the  high  lead  content  in  gasoline.  The 
larger  standard  deviations  were  probably  due  to  the  seasonal 
changes  in  automobile  travel  and  outdoor  activities  of  Uie 
children.  After  most  lead  was  removed  from  gasoline,  the 
means  of  both  the  blood  lead  and  air  lead  levels  dropped. 


Figure  2  shows  the  data  for  1974. 


Figure  2.  Fit  of  a  grouped  lognormal  distribution  to 
blood  lead  levels  in  black  children  aged  1-3  in  New 
York  City  in  1974. 


In  order  to  use  the  EM  algorithm,  the  expected 
value  of  X  and  x*  assuming  a  normal  (after  transforma¬ 
tion)  distribution  must  be  calculated  for  each  interval. 
If  the  endpoints  of  the  intervals  are  denoted  as  Cj,  then 
these  expectations  are  given  by 

E(.x  I  Ci,i<x<c)  =  p  -  a\f 

E(xf\c,_i<x<c)  =  a*(l  +  Z2j)  +  P*  -  pzijo* 
where 

Zi,  -Ac)lF(c)  and 

Zji  ==  • 

This  same  technique  can  be  used  to  fit  a  linear  model 
to  grouped  data  (see  Hasselblad  et  al.,  1980).  For 
grouped  data  problems,  the  EM  algorithm  converges 
quite  quickly,  and  the  estimates  and  log-likelihood  by 
iteration  are  shown  in  the  following  table. 


Iteration 

M 

a 

log-likelihood 

0 

3.1629 

0.4823 

-3337.0311 

1 

3.2033 

0.4118 

-3265.2098 

2 

3.2121 

0.3968 

-3260.8766 

3 

3.2143 

0.3934 

-3260.6353 

4 

3.2148 

0.3926 

-3260.6220 

5 

3.2149 

0.3924 

-3260.6212 

6 

3.2150 

0.3924 

-3260.6212 
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A  more  difficult  problem  is  the  estimation  of  parameters  from 
mixtures  of  distributions.  The  algorithm  is  quite  simple  and  was 
given  by  Hasselblad  (1966).  Part  of  the  numerical  problem 
results  from  the  likelihood  function  itself.  For  example,  Ali  and 
Giaccotto  (1982)  show  data  for  the  logarithm  of  the  ratio  of 
consecutive  monthly  prices  of  Bethlehem  Steel  stock  over  a  ten 
year  period  (shown  in  Figure  3). 


Figure  3,  Fit  of  a  mixture  of  two  normal  distributions  to  the 
logarithm  of  the  ratio  of  consecutive  monthly  prices  of 
Bethlehem  Steel  stock. 

Because  the  data  are  already  log-transformed  and  show  two 
peaks,  it  is  logical  to  fit  a  mixture  of  two  normal  distributions  to 
the  data.  If  we  look  at  a  graph  of  the  log-likelihood  function  as 
a  function  of  ^  and  Oj  with  p,  and  fixed  at  their  maxi¬ 
mum  likelihood  estimates,  we  get  the  graph  shown  in  Figure  4. 


Figure  4.  Graph  of  the  log-likelihood  function  as  a  function 
of  p2  and  02  with  p,  and  itj  fixed  at  their  maximum 
likelihood  estimates  for  monthly  prices  of  Bethlehem  Steel 
stock. 


The  spikes  on  the  left  actually  go  to  infinity  where  to  fi2 
takes  on  the  value  of  any  data  point  and  0^2  0,  a  fact 

which  was  reported  by  Day  (1969).  Thus  we  can  only 
hope  to  find  a  relative  maximum  of  the  likelihood 
function.  Furthermore,  the  likelihood  function  can  be 
very  flat  over  regions  near  the  solution  (see  the  right 
hand  part  of  Figure  4  which  is  close  to  the  maximum 
likelihood  solution). 

This  is  a  difficulty  for  any  method,  but  the  EM 
algorithm  will  continue  to  move  to  a  relative  maximum 
of  the  likelihood  without  and  step  size  correction  even 
under  these  difficult  conditions.  A  graph  of  the  likeli¬ 
hood  as  a  function  of  the  number  of  iterations  for  the 
Bethlehem  Steel  stock  data  is  shown  in  Figure  5.  Note 
that  once  the  estimates  get  reasonable  close  (after  1960 
iterations),  they  zoom  in  quite  quickly  to  the  solution. 
The  fitted  curve  was  shown  in  Figure  3, 


Figure  5.  Graph  of  the  log-likelihood  function  as  a 
function  of  the  number  of  iterations  for  monthly 
prices  of  Bethlehem  Steel  stock. 

As  indicated  earlier,  the  blood  lead  data  of  New 
York  City  in  the  years  prior  to  1974  tended  to  have 
extra  variation.  For  the  data  of  1972,  the  fit  to  a 
mixture  of  two  lognormal  distributions  is  shown  in 
Figure  6.  This  is  an  application  of  the  EM  algorithm 
to  solve  both  the  problems  of  a  mixture  as  well  as 
grouping  in  the  same  example.  Although  there  is  no 
visual  evidence  of  a  mixture,  the  likelihood  ratio  test 
for  the  improvement  in  fit  over  a  single  lognormal 
gives  a  chi-square  of  65.914  for  three  degrees  of 
freedom  (p  <  0.00001).  The  existence  of  the  two 
distributions  may  be  an  artifact  of  the  seasonal  variation 
in  lead  exposure  which  was  not  included  as  a  covariate 
in  the  analysis. 


V.  Hasselblad  221 


As  might  be  expected,  the  convergence  to  the  maximum 
likelihood  estimate  is  quite  slow  using  the  EM  algorithm,  taking 
approximately  2500  iterations  to  converge. 


blood  lead  (//g/dl) 

Figure  6.  Fit  of  a  grouped  lognormal  distribution  to  blood 
lead  levels  in  black  children  aged  1-3  in  New  York  City  in 
1974. 


6  Discussion 

The  three  numerical  techniques  just  discussed  have  proved  to  be 
useful  in  distribution  fitting.  The  modified  Gauss-Newton 
method  as  described  here  is  useful  for  a  wide  range  of  distribu¬ 
tions,  including  truncated  distributions.  In  conjunction  with  the 
EM  algorithm,  it  will  handle  grouped  distributions.  For 
mixtures  of  distributions,  the  EM  algorithm  is  superior,  although 
it  can  converge  very  slowly. 

In  combination,  these  methods  can  estimate  parameters  from 
normal,  lognormal,  exponential,  gamma,  Weibull,  logistic, 
extreme  value,  and  beta  distributions,  as  well  as  mixtures  of 
normal,  lognormal,  and  exponential  distributions.  For  informa¬ 
tion  on  software  using  these  techniques  for  distribution  fitting 
contact  the  author. 
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Abstract 

Ibis  paper  presents  some  preliminary  results  on  a  new 
method  of  density  estimation  that  models  an  unknown  distri¬ 
bution  as  a  mixture  of  normal  components.  The  model  is  first 
built  using  the  recursive  adaptive  mixtures  procedure  of 
Priebe.  This  preliminary  model  is  allowed  to  come  to  equilib¬ 
rium  with  the  data  via  a  sequence  of  term  annihilations  based 
on  the  Aikaike  Information  Criterion  in  combination  with 
adjustment  of  the  remaining  model  parameters  via  the  stan¬ 
dard  expectation  maximization  algorithm. 


Introduction 

Given  X={xi,  X2, ...,  x„)  where  each  Xj  is  i.i.d. 
according  to  an  unknown  density  f(x)  then  one  is  often  inter¬ 
ested  in  estimating  f(x).  This  problem  occurs  in  such  areas  as 
exploratory  data  analysis,  classification,  and  regression. 
There  are  a  variety  of  approaches  to  the  multivariate  density 
estimation  problem[l]. 

An  often  used  parametric  approach  is  that  of  finite 
mixture  models[2]  in  combination  with  the  expectation  max¬ 
imization  (EM)  method  of  Dempster,  Laird,  and  Rubin[3]. 
One  difficulty  with  this  tactic  is  that  one  needs  some  idea  as 
to  the  appropriate  number  of  terms  in  the  mixture  model. 
Given  this  information  the  EM  algorithm  is  guaranteed  to 
convergence  to  at  least  a  local  maxima  in  the  likelihood  sur¬ 
face. 

Some  of  the  previous  nonparametric  approaches 
include  histograms  [4],  frequency  polygons  [5],  adaptive  his- 
tograms[6],  average  shifted  histograms  [7],  and  kernel  esti¬ 
mators  [8].  These  approaches  are  beneficial  in  that  they 
possesses  nice  asymptotic  consistency  properties  and  robust¬ 
ness  with  regard  to  nonnormality.  They  are  at  a  disadvantage 
as  compared  to  the  mixture  model  approach  when  it  is  sus¬ 
pected  that  the  unknown  true  density  is  a  mixture  of  a  number 
of  components  and  one  would  like  to  estimate  the  posteriori 
probability  of  underlying  component  membership  for  an 


unlabeled  observation. 

This  type  of  problem  exists  in  the  areas  of  medical 
diagnosis  and  image  processing.  In  medical  diagnosis  the 
component  membership  may  play  an  important  role  in  iden¬ 
tification  of  the  underlying  mechanism  of  disease  or  the  iden¬ 
tification  of  appropriate  tissue  type  in  an  image.  In  the  general 
problem  of  image  analysis  the  component  membership  may 
pertain  to  region  type. 

A  recently  developed  density  estimation  technique 
that  circumvents  some  of  the  problems  of  the  above  tech¬ 
niques  is  the  adaptive  mixtures  procedure  of  Priebe  and  Mar- 
chette  [9].  This  procedure  is  a  blend  of  the  finite  mixtures  and 
kernel  estimator  approach.  It  is  essentially  a  mixtures  type 
approaches  that  allows  for  the  creation  of  new  terms  as  indi¬ 
cated  by  the  data  complexity.  We  have  successfully  applied 
this  technique  in  combination  with  firactal-based  features  to 
the  detection  of  man-made  objects  in  land[10]  and  aerial[ll] 
images,  the  general  problem  of  texture  classification[12],  and 
the  measurement  of  breast  parenchymal  tissue  density[13]. 
The  adaptive  mixtures  estimator  is  asymptotically  consistent 
like  the  kernel  estimator,  but  it  has  the  added  benefit  of  creat¬ 
ing  additional  terms  at  a  rate  which  is  considerably  less  then 
the  rate  n  creation  associated  with  the  kernel  estimator. 

One  drawback  to  the  adaptive  mixtures  estimator  is 
that  even  though  there  is  asymptotic  LI  convergence  for  the 
procedure  there  is  no  finite  sample  or  asymptotic  assurance 
that  the  match  between  the  complexity  of  the  final  model  and 
the  data  is  optimal.  Another  way  of  saying  this  is  that  if  the 
underlying  distribution  is  a  finite  mature  of  M  terms,  one 
would  like  M  terms  in  the  adaptive  mixtures  solution.  The 
goal  of  our  work  is  to  modify  the  adaptive  mixtures  procedure 
so  that  it  produces  a  model  that  not  only  matches  the 
unknown  density  in  a  functional  sense,  but  also  in  terms  of 
model  complexity. 

Approach 

Given  an  unknown  distribution  a(x)  we  seek  to 
model  the  distribution  using  a*(x)  defined  by 
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m 

a*  (*;'!')  = 

i=l 

where  K  is  some  fixed  density  parameterized  by  r,  and  ^  = 
(jti,  Fi,  7t2,  Fj. ....  V.  Tm)-  The  Jii’s  are  referred  to  as  the  mix¬ 
ing  proportions.  (We  can  assume  for  much  of  what  follows 
that  K  is  taken  to  be  the  normal  distribution,  in  which  case  Fj 
becomes  {Pi,  Oi) .)  In  the  simplest  case  the  mixture  is 
assumed  to  have  a  single  term  and  the  parameters  that  need  to 
be  estimated  are  the  mean  and  covariance  of  the  distribution. 
It  is  important  to  note  that  unlike  finite  mixture  models  the 
number  of  terms  m  is  not  fixed  but  is  driven  by  the  data. 

The  basic  stochastic  approximation  approach  is  to 
recursively  update  the  estimate  S'*  of  the  true  parameters  T'q 
based  on  the  latest  estimate  'V*  and  the  newest  data  point 
Xj+f  That  is, 

'Pt+r  =  <  +  ^i(xt+i;<)  (2) 

for  some  update  function  d>t.  This  approach  is  usually  used 
when  it  is  known  that  the  true  distribution  is  a  finite  mixture 
of  components.  However,  one  can  certainly  approach  the 
problem  from  the  perspective  of  fitting  the  data  to  a  finite 
mixture  model  where  one  finds  the  ^i+i*  that  produces  the 
best  fit 

The  specific  form  of  the  update  equation  that  we 
use  is  the  one  suggested  by  Titterington  [14].  If  we  let  ICTO 
be  the  Fisher  information  then  the  version  of  the  recursive 
update  formula  we  will  use  is 

^  1+1  =  r+  «))“‘  (3) 

where  the  derivative  represents  the  vector  of  partial  deriva¬ 
tive  with  respect  to  the  components  of 

The  AMDE  stochastic  approximation  approach  is 
to  recursively  update  S'*,  the  estimate  of  the  true  parameters 
T'o,  while  at  the  same  time  providing  the  capability  to 
expand  the  extent  of  the  parameter  space  'V  if  dictated  by 
the  underlying  complexity  of  the  data.  We  note  that  in  the 
AMDE  case  our  parameter  space  'V  is  given  by  (Jti,0i,  JC2.02. 
. 4tn.0n>  •••)•  The  procedure 

T't+i*  =  'i't*  +  A*Ut(Xi+i:'i'; )  +  B*Q(xt+i;'Pt*.t ),  (4) 

is  used  to  recursively  update  the  density  where  A=[l-Pt(x- 
t+i;T't*)].  and  B=Pt(xn.i;T'j*).  Pt  represents  a  possibly  sto¬ 
chastic  create  decision  and  takes  on  values  0  or  1 .  Ut  updates 
the  current  parameters  using  (3)  while  Q  adds  a  new  compo¬ 
nent  to  the  model.  As  is  implicit  in  the  equation,  the  decision 


to  add  a  new  term  is  a  function  of  the  current  data  point,  our 
current  estimation  of  the  parameters,  and  time.  The  time 
dependence  is  important  in  those  cases  that  we  wish  to  anneal 
the  probability  of  creation  as  a  function  of  training  time. 

Previous  work  in  the  literature  has  examined  the 
application  of  the  Akaike  Information  Criterion  (AIC)  [15]  to 
the  determination  of  the  number  of  components  in  a  finite 
mixture[16].  The  AIC  estimates  the  expected  value  of  the 
Kullback-Leibler  information  between  the  estimated  model 
and  the  unknown  true  density 

AIC  =  E[KLia,a)]  =  Jalog^  .  (5) 

AIC  is  defined  in  terms  of  likelihood,  L,  and  the  number  of 
free  parameters  in  the  model,  M,  as 

AIC(f)  =-  2ln  (fix) )  +  2M.  (6) 

Using  this  idea  as  a  starting  point  we  have  developed 
a  procedure  that  uses  a  single  or  set  of  adaptive  mixtures  den¬ 
sity  estimates  and  produces  a  pruned  model  with  a  lower 
complexity.  This  procedure  uses  AIC  to  evaluate  the  appro¬ 
priateness  of  lower  complexity  models  that  have  been  sub¬ 
jected  to  the  iterative  EM  method.  In  the  iterative  EM  method 
the  update  equation  takes  the  form 

'*',+!*  =  +  (7) 

where  d>  is  the  update  function  and  x  is  the  set  of  observa¬ 
tions. 

Our  approach  to  the  pruning  process  is  as  follows. 
Given  a*ij  an  initial  adaptive  mixtures  approximation  to  a 
containing  k  terms  the  AIC  of  each  of  the  k-1  term  models  is 
computed  after  application  of  the  EM  method  of  equation  7 
to  each  of  the  models.  If  AIC(a*ij.i)  <  AIC(a*k)  for  one  of  the 
k-1  term  models  then  the  pruning  process  is  repeated  using 
this  model.  This  process  of  pruning  and  expectation  maximi¬ 
zation  is  repeated  until  no  further  improvement  is  possible.  It 
is  important  to  point  out  that  at  each  pruning  step  the  remain¬ 
ing  terms  Jt’s  are  updated  based  on  their  Mahalonobis  dis¬ 
tance  to  the  pruned  term  prior  to  relaxation  with  the  EM 
method. 

Results 

This  pruning  approach  was  tested  on  data  sets  drawn 
from  two  different  bimodal  two  term  distributions  and  from 
one  four  mode  four  term  distribution,  see  Figure  1.  In  each 
case  10,000  points  were  drawn  from  each  distribution. 
Twenty-five  bootstrap  resamples  were  extracted  from  each  of 
the  data  sets.  A  ten  term  adaptive  mixtures  model  was  created 
for  each  of  the  resampled  data  sets.  Each  of  these  models 
were  then  subjected  to  the  AIC  based  pruning  process.  This 
process  provides  a  model  complexity  distribution  based  on 
the  data  set. 
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Case  a 


Caseb 


Casec 


Figure  1  -  ot(x)=.5*N(-2,l)+.5*N(2,l), 


a(x)=.5*N(-1.25,l)+.5*N(1.25.1), 


(x(x)=.25*N(-6,1)+.25*N(-2,1)+.25*N(2,1) 

+.25*N(6.1) 


dF  Estimate 


2 

S 


Figures  2  a  and  b  •  Adaptive  mixtures  estimates  for  two  of 
the  resamplings  of  the  data  set  drawn  from  case  a. 
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hi  Figures  2a  and  2b  we  present  adaptive  mixtures 
solutions  for  two  of  the  resamplings  of  the  data  set  drawn 
form  a(x)=.5*N(-2,l)+.5*N(2, 1).  We  have  included  dF  q>ace 
plots  along  with  the  standard  functional  representation  of  the 
distributions.  dF  space  plots  are  an  effective  way  to  display 
the  terms  in  a  mixture.  Each  term  is  plotted  as  a 

circle  whose  radius  is  proportional  to  Rj  and  whose  center  is 
given  by  We  notice  that  the  terms  in  each  of  the  two 

solutions  are  markedly  different  This  phenomena  falls  under 
the  adage  that  there  is  “more  then  one  way  to  skin  a  cat”  We 
also  notice  that  there  are  more  then  the  “theoretical”  number 
of  terms  needed.  Each  of  the  models  is  made  up  of  ten  terms. 
The  occurrence  of  a  matching  number  of  terms  in  each  model 
is  the  result  of  our  initial  constrainment  of  the  model  com¬ 
plexity. 

Figure  3  illustrates  the  results  of  the  pruning  pro¬ 
cess.  For  each  of  the  three  distributional  types  we  have  plot¬ 
ted  a  histogram  of  the  number  of  terms  in  the  final  pruned 
models  for  each  of  the  twenty-five  resamples.  In  case  a  the 
procedure  converged  to  the  correct  solution  1 1  of  25  times.  In 
case  b  the  procedure  converged  to  the  correct  solution  7  of  25 
times,  and  17  of  25  times  in  case  c.  In  all  cases  models  of 
more  appropriate  complexity  are  provided  by  the  procedure. 

The  last  thing  left  to  be  discussed  is  the  output  of  the 
pruning  procedure.  In  Figures  4  a,  b,  and  c  we  present  an 
expectation  maximized  adaptive  mixture  solution  along  with 
the  output  of  pruning  this  solution.  We  notice  that  the  number 
of  terms  in  the  solution  has  been  reduced  from  ten  to  the 
appropriate  number  in  each  case.  We  also  notice  that  the 
terms  left  from  the  process  are  in  approximately  the  correct 
location  and  have  about  the  right  mixing  coefficients  and 
variances. 

Summary 

We  have  presented  some  very  preliminary  results 
concerning  a  new  nonpaiametric  density  estimation  tech¬ 
nique  that  combines  the  flexibility  of  term  creation  with  the 
parsemoneousness  of  term  annihilation.  Much  work  needs  to 
be  done  to  strengthen  these  initial  anecdotal  results. 
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Figure  3  -  Histograms  of  the  number  of  terms  in 
the  final  pruned  models  for  each  of  the  three  test 
cases. 
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Abstract 

Terrell  [1989]  showed  that  the  probability  of  a  compact 
polyhedron  under  a  normal  distribution  could  be  obtained  via 
Parseval’s  Theorem  and  a  fast,  accurate  numerical  quadrature. 
Unfortunately,  the  efficiency  drops  with  the  size  of  the  re¬ 
gion,  so  it  is  not  a  particularly  good  way  to  obtain  tail  prob¬ 
abilities,  which  are  often  of  more  practical  interest.  However, 
an  analogous  technique  leads  to  efficient  trapezoidal  and 
Gaussian  quadratures  for  normal  tails,  which  actually  improve 
in  efficiency  as  the  region  moves  away  from  the  mean. 

I.  Introduction 

The  first  demand  that  statistics  makes  on  numerical  com¬ 
putation  is  for  probabilities  associated  with  a  normal  distribu¬ 
tion,  since  mainstream  mathematics  has  declared  that  these 
are  not  elementary  functions  in  the  sense  that,  for  example,  a 
tangent  is.  Good  algorithms  are  well-known  for  many  such 
problems;  but  by  no  means  all.  Terrell  [1989]  shows  that  one 
of  the  more  important  but  difficult  cases,  that  of  finding  the 
probability  of  a  polyhedron  in  a  normal  multivariate  problem, 
may  be  well-handled  by  a  class  of  techniques  called  Parseval 
quadratures:  The  probability  is  expressed  as  an  integral,  which 
is  then  transformed  to  an  integral  involving  the  Fourier  trans¬ 
forms  of  functions  in  the  original  integral,  by  an  application 
of  Parseval’s  Theorem.  The  new  integral  involves  smooth  func¬ 
tions  on  all  of  space,  so  that  trapezoidal  quadratures  converge 
rapidly.  Furthermore,  one  of  the  factors  is  a  normal  density,  so 
that  Gauss-Hermite  quadrature  works  very  well. 

Unfortunately,  the  method  of  Terrell  [1989]  ^plies  only 
to  compact  polyhedra,  and  its  efficiency  drops  r^idly  as  the 
maximum  distance  of  the  polyhedron  from  the  mean  increases. 
Very  often  our  practical  concern  is  with  tail  probabilities,  ap¬ 
plied  to  regions  extending  arbitrarily  far  from  the  mean.  This 
paper  will  show  how  to  apply  a  variant  of  Parseval  quadrature 
to  certain  of  these  probability  calculations,  in  such  a  way  that 
their  efficiency  actually  increases  with  the  minimum  distance 
of  the  region  from  the  mean.  The  paper  derives  the  technique, 
and  then  gives  examples. 

2.  An  Application  of  Parseval’s  Theorem 

By  a  tail  probability,  we  mean  in  the  univariate  case  for  Z 

standard  normal,  P(z  >  z)  =  Q[z)  =  ^  where 

z>0.  Our  trick  will  be  first  to  transform  to  positive  real  sup¬ 
port  by  T  =  Z-z.  Then 


Doubling  the  integral  by  reflection  about  the  origin,  and  ex¬ 
pressing  the  integrand  as  the  product  of  two  densities,  we  get 


Parseval’s  Theorem  says  that  if  /and  g  sue  densities,  and 
<|>  and  Y  their  Fourier  transforms  (characteristic  functions  of 

the  associated  random  variables),  then  S  fi-  97  •  This 

is  easy  to  remembo'  if  you  think  of  the  Fourier  transform  as  a 
rotation  through  a  right  angle;  therefore,  it  leaves  the  inner 
product  of  two  vectors  unchanged.  Then 


where  I  have  used  the  familiar  characteristic  functions  for  the 
normal  and  Laplace  families.  We  may  then  compute  the  inte¬ 
gral  either  by  the  trapezoidal  rule,  which  will  be  seen  to  work 
well  because  the  integrand  is  smooth  and  evaluation  points 
far  from  zero  quickly  make  negligible  contribution.  Alterna¬ 
tively,  Gauss-Hermite  quadrature  will  turn  out  to  work  well, 
because  one  of  the  factors  under  the  integrand  is  a  Gaussian 
density;  and  the  other  factor  is  smooth  and  very  cheap  to  com¬ 
pute. 

Before  we  look  at  these  techniques,  notice  that  the  Cauchy 
factor  under  the  integral  sign  is  usually  close  to  one  for  z  large, 
which  is  the  case  of  most  interest  when  calculating  tail  prob¬ 
abilities.  Therefore  by  subtraction  we  may  rewrite 


Quadrature  in  this  form  will  often  be  convenient,  because  we 
are  now  making  a  small  correction  to  a  classic  upper  bound 
for  the  tail  probability. 

3.  Trapezoidal  Quadrature 

A  trapezoidal  rule  quadrature  formula  for  the  real  line  is 

JT"  /{x)dx  =  hf{a  +  iti^ ,  where  a  is  a  starting  point  and 

h  is  the  spacing  between  the  quadrature  points.  We  expect  the 
accuracy  to  increase  fcH*  small  h‘,  but  of  course  the  computa¬ 
tional  burden  increases  too.  We  apply  this  to  the  integral  in 
the  previous  expression,  with  a  =  0: 
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h\*  1.0  2.0  3.0  4.0 

0.8  .1590  .02375098  .001349898090  3.16712418586-5 

0.57.15867  .0227501322  .0013498980316344  3.1671241833126-5 

0.4  .15865540  .02275013194820  .0013498980316301 

0.28 .158655254  .0227501319481792 

0.2  .15865525393148 

0.14  .1586552539314571 

Trapezoidal  Quadrature  of  Q(z) 


The  last  entry  in  each  column  is  accurate  to  the  stated  preci¬ 
sion.  About  9/A  quadrature  points  were  required  to  achieve  it 
The  error  for  ihe  trapezoidal  rule  on  the  line  is  given  by 
the  Poisson  summation  formula;  in  the  case  of  Parseval  quadra¬ 
ture,  that  becomes,  in  our  notation, 

for  zero  a  quadrature  point.  Notice  that  we  have  written  it  as  a 
sum  of  inverse  Fourier  transforms  of  a  product.  These  are  con¬ 
volutions  of  the  original  functions,  so  that 

Theorem;  For  a  trapezoidal  Parseval  quadrature  of 

/  fg  with  zero  a  knot  and  diameter  ft,  the  error  is 


In  our  application  this  becomes 


^  ^  ^  *  ^Ijdr .  We  notice  that  this 


quadrature  always  overestimates  the  answer.  By  contrast,  the 
other  symmetric  mesh,  in  which  zero  is  a  midpoint,  has  error 
the  same  expression  with  alternating  signs.  In  particular,  the 
first  term  is  negative.  Therefore,  in  practical  cases  this  other 
estimate  has  almost  the  same  size  error  with  opposite  signs. 

To  estimate  our  error,  use  the  obvious  inequality 


£  1^1  in  the  second  factor  of  the  integrand.  After  an 


«  2nkz  j 

integration  we  get  error  <  Z  e  *  •  For 

g1"-l 


ample,  letting  A  =  0.8  and  z  =  1  we  estimate  an  error  .000388; 
the  actual  error  from  our  table  is  .000245.  More  accurate 
quadratures  give  tighter  error  bounds. 


4.  Gauss-Hermite  Quadrature 

Gauss-type  quadratures  approximate  integrals  of  the  form 

/  fw  where  w  is  a  positive  weight  function  of  finite  integral 

by  a  sum  of  the  form  1)  ,  where  the  constant  weights 

w,  and  quadrature  points  x,  are  chosen  simultaneously  so  that 
the  quadrature  is  exact  on  all  polynomials  of  degree  2n.  In  the 


integral  of  section  2,  the  obvious  choice  of  weight  is  a  normal 
density;  in  that  case  quadrature  is  called  Gauss-Hermite  quadra¬ 
ture.  This  is  because  the  points  jq  ate  the  roots  of  the  Hermite 
polynomial  of  degree  n.  Tbese  are  orthogonal  with  respect  to 

the  normal  weight  function.  Then  /(x)  = - 1 — ^  ;  the  cost 

1  +  (?/xJ 

of  each  evaluation  of  this  function  is  amazingly  small,  and  by 

symmetry  we  need  only  %  of  them. 

The  constants  were  obtained  from  Abramowitz  and  Stegun 
[1972,  p.924],  and  the  following  examples  evaluated: 


n\t  1.0  2.0  3.0  4.0 

5  .1675  .022813  .00135020  3.167212e-5 

10  .15717  .02274823  .0013498962  3.167124069e-5 

exact  .158655  .02275013  .00134989803  3.1671241833e-5 


These  examples  do  not  exhibit  the  very  high  degrees  of  accu¬ 
racy  in  the  earlier  table;  but  you  must  remember  that  the  first 
row  corresponds  to  only  two  evaluations  off.  and  the  second 
to  only  five! 

There  exists  an  error  estimate  for  Gauss-type  quadratures, 
due  to  Markoff  (see  e.g.  Davis  [1975,  p.344]);  but  it  seems  to 
be  enormously  larger  flian  the  observed  error  in  all  examples 
tested,  and  so  is  of  no  practical  value. 


5.  Multivariate  Normal  Tails  by  Parseval’s 
Theorem 

There  are  well-known,  rapid  methods  for  finding 
univariate  normal  tail  probabilities;  but  not  so  for  the  multi¬ 
variate  case.  We  will  let  a  multinormal  tail  region  be  defined 

as  follows:  let  X  =  N[0,x)  .  Then  by  X  >  z  we  wiD  mean  that 
the  random  variable  exceeds  the  fixed  vector  z  coordinate  by 
coordinate.  Then 


We  move  the  comer  of  our  orthant  to  the  origin  by  the  change 
of  variables  Y  =  X  -  z ,  and  expand  the  exponent  to  get 


As  before,  our  tactic  will  be  to  expand  the  integrand  by  reflec¬ 
tion  to  cover  Euclidean  n-space;  then  the  second  factor  will 
be  a  product  of  Laplace  densities.  This  only  works  when  we 
impose  a  condition  on  the  orthant  comer  z: 
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Dsffi  For  a  multinonnal  X  -  ^(o,£]  random  vector. 


X  >  z  wiU  be  called  a  tail  orthant  whenever  £  'z  >  0  . 

This  requires  that  die  marginal  density  along  each  edge 
of  the  region  be  everywhere  decreasing.  In  this  case. 


Now  apply  the  multivariate  Parseval’s  theorem  to  get 


WY 


P(X>z)= 


which  has  a  smooth  integrand,  weU  suited  to  quadrature.  The 
computations  become  much  cheaper  if  we  transform  to  a 
spherically-symmetric  normal  distribution: 


P(X>z)  = 


I* 


I 

dt  . 


We  may  compute  this  by  a  trapezoidal  quadrature  on  a 
square  mesh  of  width  h.  The  error  analysis  parallels  the 
univariate  case.  For  example,  let  {X,Y)  be  standard  normal 

variables  with  correlation  0.5;  we  compute  P(x  >  z,  F  >  z] 

for  various  values  of  z: 


hVc  1.0  2.0 

0.57 .05215  .003611099 


3.0  4.0 

7.571711056-5  4.60706480176-7 


0.4  .052294  .00361111817  7.57171115636-5  4.6070648020066-7 

0.28 .05230034  .00361111821034  7.57171115629856-5 

0.2  .0523004066  .0036111182103472 


0.14  .052300406764267 
0.1  .0523004067642694 

Trapezoidal  Quadrature  of  P(X>z,  Y>z) 


Here  the  starting  point  was  not  the  origin,  but  {hl2,hl2y,  there¬ 
fore  the  estimates  are  on  the  low  side.  Approximately  the  square 
of  the  number  of  functicm  evaluations  were  required  here  com¬ 
pared  to  the  univariate  case;  but  a  Mac  Quadra  800  totdc  less 
than  a  second  to  do  the  longest  of  these  computations. 

Gauss  Hermite  quadrature  may  be  productively  ^plied 
here;  the  evaluation  points  and  weights  are  just  the  tensor  prod¬ 
uct  of  those  in  the  univariate  case. 


6.  Bibliography 

Abramowitz,  A.,  and  Stegun,  I.  A.  (1 972).  Handbook  of  Math¬ 
ematical  Functions.  Dover:  New  York. 


Davis,  P.  J.  (1975).  Interpolation  and  Approximation.  Do¬ 
ver:  New  Yoric. 


Terrell,  G.  R.  (1989).  Parseval  quadrature  for  computing 
multinonnal  probabilities.  Technical  Report  89-2,  Depart¬ 
ment  of  Statistics.  Virginia  Polytechnic  Institute  and  State 
University. 


P.N.  Somerville  and  M.C.  Wang  229 


COMPUTATION  OF  MULTIVARIATE  NORMAL 
PROBABILITIES 
OVER 

CONVEX  REGIONS 

Paul  N.  Somerville 
University  of  Central  Florida 

Morgan  C.  Wang 
University  of  Central  Florida 


ABSTRACT 

A  method  is  presented  for  numerical  evaluation  of  a 
muitivariate  normai  integral  over  any  convex  region. 
The  method  is  efficient  for  interactive  anaiyses  for 
moderate  accuracies  (e.g.  approx,  two  decimai 
piaces).  The  method  has  been  appiied  to  the 
computation  of  criticai  values  for  multiple  comparison 
methods  in  the  general  linear  model. 

1.  INTRODUCTION 

Let  X  =  (Xp  Xj,  ...  ,x^'  have  the  multivariate  normal 

distribution  f(x)  =  where  r  is  the 

correlation  matrix  of  x  and  <t  is  a  scalar.  There  are 
many  problems  in  statistics  that  require  computation 
of  f(x)  over  some  region  R.  That  is 

ff(x)  dx. 

R 

For  the  case  when  the  region  of  integration  is 
rectangular,  the  problem  has  been  addressed  by  many 
authors.  They  include  Gupta  (1964),  Milton  (1972), 
Schervish  (1984),  Deak  (1986),  Olson  and  Weissfeld 
(1991),  Genz  (1992),  Drezner  (1992),  and  Kennedy 
and  Wang  (1991,  1992).  However,  regions  of 
integration  for  statistical  applications  such  as  multiple 
comparison  procedures  are  not  rectangular.  For 
example,  the  critical  value  q  lor  (1  -  a)  simultaneous 
confidence  intervals  for  Tukey's  (1953)  pain/vise 
differences  of  population  means  can  be  obtained  by 
solving  the  following  probability  equation  for  q 
Prob {[(y, - Yj) - (mi - Mj)J  <  for i =  1 -a 

where  y|  is  the  least  squares  estimate  of  and  Vjj  is 
the  MVUE  for  the  variance  of » -  yj.  The  region  of 
integration  is  convex  and  bounded  by  k(k  -  1) 
hyperplanes.  In  this  paper,  we  describe  a  method  for 
computation  of  the  multivariate  normal  integral  over 
any  convex  region.  The  region  of  integration  is 
essentially  reduced  to  a  single  dimension  and 
integration  is  accomplished  with  the  assistance  of 
Monte  Carlo  methods. 


2.  METHODOLOGY 

Although  the  method  is  valid  for  any  convex  region, 
we  shall  assume  the  region  of  integration  includes  the 
origin.  With  no  loss  of  generality,  we  assume  the 
mean  is  at  the  origin.  Further,  in  our  development,  we 
shall  assume  the  region  R  is  bounded  by  m  (>  1) 
hyperplanes  and  is  described  by 
Lx<  d 

where  L'  =  (Ipl^,  ...  ./J  and  the  f*'  hyperplane  is 
given  by  Ij'x  <  dj.  Our  first  step  is  to  make  a 
transformation  so  that  the  new  variables  WpW2, 
are  NID(0, 1).  Let  J  =  7  r  (Cholesky  decomposition), 
and  set  X  =  7*  w.  The  region  R  becomes 
Gw<d 

where  G  =  LT.  Setting  G'  =  (gp  gs, ... ,  gj.  the 
hyperplane  becomes  gjw  =  dj. 

We  discuss  two  cases. 

Case  1  I,  known 

Case  2  I  known,  unknown,  is  an  unbiased 
estimate  of  with  v  degrees  of  freedom. 

Our  strategy  is  as  follows. 

a)  Choose  a  unit  random  direction  c  =  (CpC2,...,c0. 

b)  Obtain  distance  r  to  the  boundary  in  the  direction  c. 

Case  1 . 

c)  Since  =  ivW,  =  c'c  has  a  %  ^distribution 
with  k  degrees  of  freedom. 

d)  Problx  ^  <  r^]  is  an  unbiased  estimate  of 
the  integral  value. 

Case  2. 

c) (fi/k)/s^  has  „  distribution. 

d)  Prob[F,^y  <  (fi/k)/s?]  is  an  unbiased 

estimate  of  the  integral  value. 

e)  Repeat  steps  a)  to  d)  until  the  average  of 
the  estimates  has  a  specified  standard  error. 
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3.  DISTANCE  TO  THE  BOUNDARY 

For  a  given  direction  c,  the  distance  from  the  origin  to 
the  plane  j  is  dj  /(gjc)  =  dj  /ay  =  /y  (say).  The 

intersection  of  the  line  with  the  plane  j\sw  =  dj/(gjc)  c 
=  /y  c.  The  above  intersection  will  be  in  the  region  R  if 
9j'w  =  gjrj  c  =  fy  a,- <  cf,  /  =1,2, 

The  distance  to  the  boundary  in  the  direction  c  is  the 
distance  to  the  plane  for  which  rya,  <  d)for  all  /V  j. 

4.  ALGORITHM 

1.  Input  data:  X,  or  and  v,  L,  d,  m,  k,  s,  NMAX 
and  SEED. 

2.  Obtain  the  matrix  G  =  L  T,  where  T  is  the  lower 
triangular  matrix  of  the  Cholesky  decomposition  for 
or  0^1. 

3.  Set  SUM  =  SUMSQ  =  0,N  =  0,  STD  =  0. 

Do  while  STD  <  e  and  N  <  NMAX 
Set  TSUM  =  0  and  TSUMSQ  =  0 
Repeat  (a)  to  (c)  lOk^  times 

a)  Generate  a  unit  random  direction  c. 

b)  if  for  some/  /ya/<c/forall  =  1, ...  ,m, 

where  ay  =  gjc,  rj  =  dj/aj, 

then  r  =  rj,  set  tt=  Prob[zi(  ^  <  r^J  or 
tt  =  Prob[F„„<(r2/k)/s2]. 
else  set  tt  =1. 

c)  TSUM  =  TSUM  +  tt, 

TSUMSQ  =  TSUMSQ  +  tt*tt. 

d) N  =  N+1,  SUM  =  SUM  +  TSUM, 

SUMSQ  =  SUMSQ  +  TSUMSQ, 

MVN  =  SUM /lONk^, 

STD  =  ((SUMSQ  -  SUM*MVN)/(10Nk^ 

(lONk^- 1)))^^. 

4.  Output  is  MVN,  STD,  and  lONk^. 


5.  EXPERIMENTAL  RESULTS 

The  following  integral  was  calculated  for  40  different 
sets  of  parameters. 

f.jMVN(0,I)  dx 

Z  was  a  correlation  matrix  with  equal  non-diagonal 
elements  p,  and  the  limits  of  integration  for  each 
variable  were  -«>  to  a.  Values  for  p  were  0  (.1 )  .9  and 
four  diferent  values  of  a  were  randomly  chosen  from 
the  interval  (1 .5,2.5).  Sufficient  random  directions 
were  obtained  to  obtain  a  standard  error  of  .002  for 
the  calculated  value  of  the  integral.  The  number  of 
random  directions  required  was  approximately  1 00  k^. 
The  following  table  gives  the  average  absolute  error 
and  the  standard  deviation  of  the  average  absolute 
error  for  various  values  of  k. 


average 

sd  of  average 

k 

absolute  error 

absolute  error 

3 

.0016 

.0014 

4 

.0018 

.0015 

5 

.0013 

.0011 

6 

.0022 

.0019 

7 

.0016 

.0013 

8 

.0018 

.0014 

9 

.0019 

.0015 

10 

.0021 

.0018 

12 

.0011 

.0007 

14 

.0016 

.0014 

16 

.0013 

.0010 

20 

.0013 

.0011 

6.  CONCLUSIONS 

A  method  Is  presented  for  the  evaluation  of  a 
multivariate  normal  integral  over  any  convex  region. 
The  method  is  efficient  for  moderate  acuracies  (  e.g. 
approximately  two  decimal  places.  Quoting  from 
Berger  (1991),  "...for  statistical  problems  ...  two 
significant  digit  accuracy  typically  suffices,  and  only 
rarely  are  more  than  three  ...  needed."  The  method  is 
thus  practical  for  a  wide  variety  of  statistical  problems. 
A  typical  application  is  to  problems  In  multiple 
comparisons.  Application  of  the  methods  presented 
here  are  given  in  Somerville  (1993a,b,  1994). 
Computation  times  are  such  that  interactive  statistical 
analyses  are  practical  on  80386  or  80486  processors. 

Future  research  on  more  sophisticated  sampling  and 
computational  methods  should  significantly  decrease 
processing  times.  The  method  is  well  adapted  to  the 
use  of  parallel  processors. 
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Abstract 

There  is  no  program  for  doing  order  restricted 
statistical  inference  in  any  major  statistical 
software  package  due  to  the  hurdle  of  compu¬ 
tation.  As  a  first  step,  we  now  develop  a  set 
of  efficient  programs  for  simulating  p-values 
of  chi-bar  square  statistics  and  level  proba¬ 
bilities,  which  includes  complete  order,  ma¬ 
trix  order,  cubic  order,  tree  order  and  general 
partial  order.  We  discuss  efficiency  and  accu¬ 
racy  of  these  programs,  compare  two  meth¬ 
ods  of  simulating  p-values  for  chi-bar  square 
statistics,  (1)  a  direct  method  and  (2)com- 
puting  p-values  by  Monte  Carlo  estimates  of 
the  level  probabilities.  We  give  an  example 
of  a  clinical  trial  to  illustrate  the  applications 
of  these  programs. 

1  Introduction 

The  distributions  are  the  most  impor- 
tant  kind  of  distributions  in  order  restricted 
statistical  inference.  The  book  <cOrder 
restricted  statistical  inference^**  written  by 
Robertson,  Wright  and  Dykstra  (1988)  stud¬ 
ied  distributions  in  Chapters  two  and 
three.  Because  of  the  variety  and  complex¬ 
ity  of  partial  orders,  it  is  difficult  to  study 
distributions  for  general  partial  orders  from 


their  point  of  view.  Therefore,  the  book 
only  deals  with  several  simple  cases,  such  as 
a  complete  order  or  a  simple  tree  order  on 
an  index  set  with  small  size.  This  situation 
made  application  of  order  restricted  inference 
be  very  restrictive,  none  of  major  statistical 
software  packages  contains  x^  distributions 
and  order  restricted  inference.  Because  of 
development  of  efficient  algorithms  for  iso¬ 
tonic  regressions  and  modern  computers,  we 
are  able  to  study  x^  distributions  from  ap¬ 
plication  point  of  view  with  computer  inten¬ 
sive  method.  In  this  article,  we  provide  two 
methods,  RF(relative  frequency)  method  and 
LP(level  probability)  method,  to  obtain  p- 
values  of  x^  distributions,  and  compare  their 
accuracy. 

For  simplicity,  we  first  introduce  isotonic 
regressions  on  a  two  dimensional  grid.  Let 
^  :  *  =  1, . . . ,  U i  =  li  •  •  • ,  </}  be  an 

I  X  J  grid.  A  function  /(•,  •)  on  X  is  said  to 
be  isotonic  if  /(•,  •)  is  increasing  in  both  vari¬ 
ables.  A  function  •)  on  X  is  said  to  be  an 
isotonic  regression  of  a  given  function 
with  known  weights  •)  on  X,  if  •)  is 
a  solution  of  the  following  minimization  prob¬ 
lem: 

subject  to  /(•,•)  is  isotonic  on  X. 
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Let  g{i,j),i  =  be  n  = 

1  X  J  independent  normal  random  variables 

with  mean  0  and  variance  l/w{i,j),  J  be 
their  weighted  sample  mean,  and  §*{-,•)  be 
the  isotonic  regression  of  with  weights 
«;(.,■).  Then  Xqi  =  -»)*«'(»>  i) 

is  called  a  Xoi  random  variable;  and  X12  = 

is  called  a  X12 

random  variable.  Let  M  be  the  number 
of  distinct  values  of  the  isotonic  regression 
then  M  is  a  random  variable  taken 
values  1, . . . ,  71.  The  probability  distribution 
of  M  is  called  level  probabilities,  distri¬ 
butions  are  mixture  of  distributions.  Re- 
bertson  et  al(1988)  shows  that, 

P(Xoi  >  c)  =  2  i>(M  =  k)Pixl.i  >  c);  (1) 

Jb=2 

P(Xi2  >  c)  =  £  >  c).  (2) 

Jb=l 

The  definitions  of  isotonic  regression,  x^ 
distributions  and  level  probabilities  can  be 
generalized  to  a  partially  ordered  finite  index 
set  X. 

2  An  Example 

Cornfield  (1962)  provides  a  data  set  from  a 
clinical  trial.  It  is  also  listed  in  Agresti(1990). 
This  is  a  sample  of  male  residents  of  Fram¬ 
ingham,  Massachusettes,  aged  40-59  clas¬ 
sified  both  by  blood  pressure  and  serum 
cholestrerol  level.  In  this  data  set,  the  re¬ 
sponse  variable  is  a  binary  variable  with  1 
for  occurrence  of  heart  disease  and  0  for  non¬ 
occurrence.  By  medical  theory,  the  rate  of 
presenting  heart  disease  increases  when  blood 
pressure(®)  or  serum  cholesterol(y)  increases. 
The  covariate  x  is  divided  into  8  ordered 
levels,  and  the  covariate  y  is  divided  into  7 
ordered  levels.  Therefore,  we  have  56  cells 


in  total.  The  cell  sample  proportions  can 
be  viewed  as  a  function  y(-,  •)  on  an  8  x  7 
grid.  The  ordered  maximum  likelihood  esti¬ 
mate  function  y *(*,*)  of  the  rate  of  present¬ 
ing  heart  disease,  which  is  increasing  both  in 
blood  pressure  and  serum  cholesterol,  is  an 
isotonic  regression  of  •)  with  given  weights 
*),  where  w(i,  j)  is  the  number  of  observa¬ 
tions  in  the  (a,  j)  cell.  The  ordered  maximum 
likelihood  estimates  are  listed  in  Table  1, 

In  order  to  verify  the  statement  that  the 
rate  of  presenting  heart  disease  increases 
when  blood  pressure  or  serum  cholestrerol  in¬ 
creases,  we  consider  following  models  for  the 
data  set. 

Jlfo-  constant  rate  model,  that  is,  the  rate 
of  heart  disease  depends  on  neither  blood 
pressure  nor  serum  cholestrerol. 

Mil  isotonic  rate  model,  that  is,  the  rate 
of  heart  disease  is  increasing  both  on  blood 
pressure  and  serum  cholestrerol. 

M2'  saturated  model,  that  is,  the  rate  of 
heart  disease  is  arbitrary. 

Let  Toi  be  a  likelihood  ratio  test  statistic 
for  testing  independence  model  Mo  vs  iso¬ 
tonic  model  Ml  ~  Mq.  Let  T12  be  a  likeli¬ 
hood  ratio  test  statistic  for  testing  Mi  vs  sat¬ 
urated  model  Af2  —  Ml.  Computational  for¬ 
mulas  for  Toi  and  T12  can  be  found  in  Robert¬ 
son  et  al(1988).  For  the  Cornfield  data  set, 
Toi  =  70.66  and  ri2  =  50.50.  Robertson  and 
Wegman  (1978)  has  shown  that,  the  asymp¬ 
totic  distributions  of  the  likelihood  ratio  test 
statistics  Toi  and  Tn  are  Xoi  X12  distri¬ 
butions  respectively. 

3  P-values 

There  are  two  ways  to  obtain  p-values  of  x* 
distributions  by  computer  intensive  method. 
The  first  one  is  a  direct  method.  We  gen¬ 
erate  a  random  sample  of  a  x^  distribution 
with  size  of  N,  count  the  frequency  f  of  which 
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Table  1:  Ordered  Maximum  Likelihood  Estimates  for  the  Cornfield  Data 


-200 

200-209 

210-219 

220-244 

245-259 

260-284 

284- 

-117 

0.0106 

0.0106 

0.0106 

0.0106 

0.0106 

0.0303 

0.0303 

117-126 

0.0106 

0.0250 

0.0364 

0.0478 

0.0478 

0.0990 

0.0990 

127-136 

0.0242 

0.0250 

0.0364 

0.0478 

0.0478 

0.0990 

0.0990 

137-146 

0.0242 

0.0250 

0.0364 

0.0741 

0.0943 

0.0990 

0.1618 

147-156 

0.0364 

0.0364 

0.0364 

0.0845 

0.0943 

0.1618 

0.1618 

157-166 

0.0364 

0.0364 

0.0364 

0.0845 

0.0943 

0.1618 

0.2679 

167-186 

0.0811 

0.0811 

0.0811 

0.0845 

0.2500 

0.2679 

0.2679 

186- 

0.1667 

0.1667 

0.2500 

0.2500 

0.2500 

0.2679 

0.2679 

exceed  the  specified  test  statistic  c,  then  the 
relative  frequency  p  =  f/N  is  an  unbiased 
estimate  of  p  =  >  c).  We  call  this 

method  RF(relative  frequency)  method.  In 
RF  method,  Np  is  a  binomial  random  vari¬ 
able  with  mean  Np  and  variance  ^^(1  ~  p). 
Thus,  p  has  mean  p  and  variance  p(l  —  p)/N. 
The  second  method  is  using  simulation  to  ob¬ 
tain  estimates  of  level  probabilities,  then  ap¬ 
plying  formula  (1)  or  (2)  to  get  an  estimate  p 
of  p.  We  call  this  method  LP(level  probabil¬ 
ity)  method.  Let  Trjfe  be  an  unbiased  estimate 
of  P{M  =  i),  for  =  1, . . . ,  n.  Then  a  LP 
estimate  p  of  p  can  be  obtained  by  following 
formulas. 

n 

P=Z)’rjtP(xI_i  >  c),  (3) 

Jb=2 

n-1 

orp=^fl-jkP(x*_fc>c).  (4) 

ib=l 

The  LP  estimate  p  is  also  an  unbiased,  con¬ 
sistent  estimate  of  p.  Applying  results  in  sec¬ 
tion  12.1.5  of  Agresti(1990),  we  obtain  follow¬ 
ing  theorem. 

Theorem  1  (1).  =p. 

(2a).  For  Xoi  random  variable,  Var{jp)  = 
IEL2^(xLi  >  =  k)  -p^VN; 


(2b).  For  Xi2  random  variable,  Var(p)  = 
ELz  Pixl.k  >  c?P(M  =  k)-  P^]/N; 
(3).Var{p)  <  Var(p). 

Thus,  We  can  estimate  variance  of  p  by  fol¬ 
lowing  formulas. 

Var(p)  =  [f2  Pixl-i  >  -  P^m  (5) 

k=2 

Var(p)  =  P(xl-k  >  ~p^]/N;  (6) 

Jfc=l 

4  Programs 

Programs  we  developed  forx^  distributions 
are  based  on  algorithms  described  in  Eddy 
and  Qian(1994)  and  Qian(1992),  which  are 
shown  to  be  better  than  other  algorithms  for 
isotonic  regressions.  The  xbarg  program  for 
p- values  of  distributions  on  an  arbitrary 
partial  ordered  set,  which  is  written  in  C  lan¬ 
guage,  applies  acceptance  sampling  method 
to  generate  normal  random  variables,  uses  the 
IBCR  algorithm  in  Qian(1994)  to  find  iso¬ 
tonic  regressions,  and  utilizes  recursive  for¬ 
mula  to  find  p- values  of  distributions.  This 
program  requires  to  input  a  partial  order. 
To  simplify  the  input  and  make  the  program 
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Table  2:  Timel 


in  seconds)  of  the  programs 


size 

time 

xbarl 

1000 

930.87 

xbart 

15 

16.34 

xbar2 

20x  20 

6003.75 

xbar3 

3x3x3 

210.99 

xbarg 

30 

262.07 

independence  model  Mq.  The  p  value  of 
Ti2  is  0.393,  so  there  is  no  significant  dif¬ 
ference  between  isotonic  model  M\  and  sat¬ 
urated  model  M2.  Thus,  we  prefer  isotonic 
model  Ml  —  Mo ,  in  other  words,  we  have  sta¬ 
tistical  evidence  that  the  rate  of  heart  dis¬ 
ease  increases  when  blood  pressure  or  serum 
cholestrerol  increases.  The  Table  1  gives  es¬ 
timates  of  the  rate  of  heart  disease  in  dif¬ 
ferent  levels  of  blood  pressure  and  serum 
cholestrerol. 


Table  3:  P- values  for  “xf  statistics  in  Corn 


Toi  =  70.66 

Ti2  =  50.50 

p  stdev 

p  stdev 

RF 

0.000  0.0000 

0.394  0.0015 

LP 

0.000  0.0000 

0.393  0.0003 

more  efficient,  we  implement  other  four  pro¬ 
grams,  artarl,  xharty  xbar2  and  xbarS  pro¬ 
grams  for  simulating  p- values  of  distribu¬ 
tions  on  a  completely  ordered  set,  a  tree  or¬ 
dered  set,  a  rectangular  grid  with  componen¬ 
twise  increasing  order,  and  a  cubic  grid  with 
componentwise  increasing  order  respectively. 
These  five  programs  are  both  for  equal  and 
unequal  non-negative  weights. 

Utilizing  results  in  Eddy  and  Qian(1994) 
and  Qian(1992),  we  find  that  xbarl^  xbart^ 
and  xbar2  are  strongly  ploynomial  time  pro¬ 
grams.  Qian(1992)  also  gives  worst  time  com¬ 
plexity  of  xbari  and  xbarg  programs.  Table 
2  lists  the  time(in  seconds)  to  run  these  pro¬ 
grams  on  an  HP  9000  model  735  workstation 
with  100,000  simulations. 

We  use  xbar2  program  to  find  p-values 
of  statistics  in  Cornfield  example  with 
100,000  simulations  on  a  486  DX/50  PC.  It 
took  1159  seconds  to  obtain  p-values  of  these 
X^  statistics.  Results  are  listed  in  Table  3. 

The  p  value  of  Toi  is  0.0,  so  we  reject 


5  Accuracy 

The  accuracy  of  the  results  is  the  most  im¬ 
portant  issue  for  simulations.  It  depands  on 
good  quasi-random  number  generators.  We 
use  several  ways  to  test  our  normal  random 
number  generators. 

We  use  tables  in  the  appendix  of  Robert¬ 
son  et  al(1988)  to  check  our  programs.  In  Ta¬ 
bles  1-5  in  their  appendix,  excluding  bound¬ 
ary  points,  there  are  99  cases  for  Xoi  distribu¬ 
tions  and  99  cases  for  X12  distributions,  each 
case  has  6  critical  values  which  have  p-values 
0.1,  0.05,  0.025,  0.01,  0.005  and  0.001  respec¬ 
tively.  We  calculated  all  the  p-values  for  these 
critical  values  by  both  RF  method  and  LP 
method.  Among  1188  RF  estimates  p  we  cal¬ 
culated,  there  are  44  estimates  that  the  true 
p-values  are  not  within  two  standard  devia¬ 
tions  of  its  estimate  p.  The  failure  rate  is 
about  3.7%.  For  LP  estimate  p  of  p,  we  say  a 
case  is  failure,  if  one  of  the  true  values  is  not 
within  two  standard  deviations  of  its  estimate 
p.  We  find  there  are  5  failures  among  99  cases 
for  Xoi  distributions,  and  4  failures  among  99 
cases  for  Xu  distributions.  The  failure  rate  is 
approximately  4.5%.  The  results  are  agreed 
with  theoretical  analysis.  We  computed  level 
probabilities  for  complete  orders,  simple  tree 
orders  and  unimodel  orders,  and  compared 
results  with  Tables  10,  11  and  20  in  the  ap- 
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pendix  of  Robertson  et  al(1988).  We  also 
computed  level  probabilities  for  rectangular 
grids  and  compared  results  of  Moonesinghe 
and  Wriglit(1994).  They  are  all  very  close. 

Then  ,  we  apply  theoretical  results  in  Sec¬ 
tion  3  to  do  more  checks.  Let  p  =  P(xoi  > 
First,  we  can  use  different  seeds  to  get  a  r^tn- 
dom  sample  of  p,  so  we  can  obtain  sample  es¬ 
timate  and  sample  standard  deviation(SSD). 
Clearly  SSD  should  be  almost  same  as  the 
standard  deviation  by  the  formulas  listed  in 
Section  3.  Second,  If  the  numbers  of  sim¬ 
ulations  are  different,  then  the  ratio  of  the 
standard  deviation  estimates  should  be  the 
square  root  of  the  ratio  of  the  numbers  of  sim¬ 
ulations.  Third,  the  LP  estimate  p  is  better 
than  the  RF  estimate  p.  Actually  what  we 
did  is  the  following.  In  Cornfied  data  exam¬ 
ple,  we  choose  six  values  for  xh  statistics, 
and  six  values  for  Xi2  statistics,  we  get  esti¬ 
mates  of  p  by  1000  simulations,  10,000  simu¬ 
lations  and  100,000  simulations  respectively. 
We  repeated  this  procedure  100  times.  Table 
4  lists  statistics  for  p  =  P(xoi  >  12.571289). 
Due  to  the  length  of  this  paper,  the  other 
eleven  tables  are  omitted.  Our  simulations 
show  that,  the  SSDs  are  almost  the  same  as 
the  one  calculated  by  (5);  the  ratios  of  stan¬ 
dard  deviations  are  near  a/OTT  =  0.316228; 
the  LP  estimate  p  is  statistically  better  than 
the  RF  estimate  p.  We  applied  this  method 
for  complete  orders,  tree  orders  and  special 
partial  orders  with  equal  weights  or  unequal 
weights  and  got  same  conclusions.  So  we  are 
sure  our  normal  random  generator  works  re¬ 
ally  well.  From  our  extensive  simulations,  we 
have  following  recommandations  for  simulat¬ 
ing  distributions. 

1.  Use  LP  estimate  p  instead  of  p. 

2.  Apply  formulas  (5)  and  (6)  to  get  an 
estimate  of  the  standard  deviation  of  p. 

3.  In  hypothesis  testing,  do  1000  simula¬ 
tions  first,  then  run  100,000  simulations  if 
necessary. 


Table  4:  Some  Statistics  for  p  =  P(Xoi  > 


12.571289)  in  Corni 

field  Example 

1000 

10000 

100000 

P 

0.102418 

0.101287 

0.099986 

stdev 

0.003269 

0.001065 

0.000333 

ratio 

ysmm 

0.312676 

SSD 

0.00347 

0.00108 

0.00031 

P 

0.093 

0.0989 

0.10152 

stdev 

0.009184 

0.002985 

0.000955 

ratio 

0.325022 

0.319933 

SSD 

0.00936 

0.00308 

0.00096 
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Abstract 

Recently,  interest  concerning  the  utilization  of 
statistical  process  control  (SPC)  and  engineering  process 
control  (EPC)  has  increased.  This  research  is  concerned 
with  the  utilization  of  SPC  techniques  in  process  control. 
For  simplicity,  this  research  considers  the  case  where  no 
feedback  control  action  is  in  a  process:  an  open  loop 
control.  In  order  to  explain  the  utilization  of  the  cuscore 
control  charts,  two  cases  —  one  case’s  underlying  process 
follows  AR(1)  and  the  other  follows  ARMA(1,1)  process 
with  mean  shifts  in  the  process  —  will  be  discussed.  In 
addition,  some  simulation  studies  will  be  presented. 

Introduction 

Recently,  interest  concerning  the  utilization  of 
statistical  process  control  (SPC)  and  engineering  process 
control  (EPC)  has  increased.  This  research  is  concerned 
with  the  utilization  of  SPC  techniques  in  process  control. 

In  this  research,  the  main  objective  of  using  the 
SPC  techniques  is  to  detect  the  mean  shift  or  transient 
disturbance  in  real  time.  Since  the  traditional  Shewhart 
control  charts  are  not  sensitive  in  detecting  small  shifts 
in  the  process  [2]  [4],  the  Shewhart  control  charts  were 
not  used  in  this  research.  Although  the  cumulative-sum 
(or  cusum)  control  chart  is  effective  in  detecting  small 
shifts  in  the  process,  the  cusum  chart  would  be  very  slow 
to  detect  large  process  shifts  [2]  [4].  Therefore,  this 


research  will  not  consider  cusum  charts  to  detect  the 
unplanned  transient  disturbances.  Instead,  since  the 
cuscore  chart  is  not  only  effective  in  detecting  the  small 
shifts  in  the  process  but  also  provides  cuscore  statistic 
which  is  helpful  to  identify  the  unplanned  transient 
disturbances,  the  cumulative  score  (or  cuscore)  chart  [1] 
is  appealed  in  this  study. 

Cuscore  Charts  For  Open  Loop  Control 

Shao  et  al.  [3]  have  addressed  the  control  of 
transient  disturbance.  For  simplicity,  this  research 
considers  the  case  where  no  feedback  control  action  is  in 
a  process:  an  open  loop  control.  In  order  to  explain  the 
utilization  of  the  cuscore  control  charts,  two  cases  —  one 
case’s  underlying  process  follows  AR(1)  and  the  other 
follows  ARMA(1,1)  process  --  will  be  discussed. 

The  details  of  the  cuscore  chart  can  be 
referenced  in  [1].  The  concept  of  the  cuscore  charts  is 
described  as  follows.  Consider  a  model  which  can  be 
written  as 

(1) 

where  y^  is  the  output  observations,  Xt  is  the  independent 
variable,  m  is  a  certain  unknown  parameter,  and  f()  is 
some  certain  function.  If  m  is  the  true  value  of  the 
unknown  parameter,  the  resulting  a^’s  would  follow  a 
white  noise  sequence.  Apart  from  a  constant,  the  log 
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likelihood  for  m=mo  is 

where  a^^’s  are  obtained  by  setting  m=mo  in  Equation 
(1).  Let 


then  the  following  relationship  holds: 


Ve  -  <l>ye-i+ae  » 

where  y,  is  the  output  deviation  from  target  at  time  t,  ^ 
is  a  certain  parameter,  and  aj’s  stand  for  white  noise. 

Now  consider  the  utilization  of  cuscore  control 
chart  to  detect  the  mean  shift.  Suppose  that  one  want  to 
detect  the  mean  shift  which  has  the  magnitude  of  one 
standard  deviation  (i.e.,  for  simplicity,  this  research 
considers  that  one  standard  deviation  equals  1),  then  the 
underlying  process  can  be  reformed  as: 

,  (5) 


dl  im) 
dm 


The  cuscore  statistics  with  the  parameter  value  m=nio  is 
defined  as: 

n 

"  52  ^to  ^to  ’  ^  ^  ^ 

t-1 


Box  and  Ramirez  [1]  addressed  that  the  following 
relationship  holds  if  the  model  is  linear  in  parameter  m 
and  approximate  otherwise: 

•  (4) 


Furthermore,  Box  and  Ramirez  [1]  remarked  the  so 
called  centred  cuscore  (CC)  is  the  cuscore  evaluated  at 

in  -  ni  -  (iWo+iHi)  /2  ,  and  the  CC  is  defined  as; 

n 

cc-'£T^Tt , 

t-1 


where  m  is  a  certain  parameter,  and  5  stands  for  the 
magnitude  of  the  mean  shift.  If  the  process  is  operated 
correctly,  the  parameter  m  should  be  zero  and  there 
should  be  no  mean  shift  in  the  process.  On  the  other 
hand,  if  some  disturbances  exist  in  the  process,  the  mean 
shift  would  be  present  in  the  process.  Therefore,  in  this 
case,  the  utilization  of  cuscore  chart  is  equivalent  to 
doing  a  hypothesis  test  for  the  parameter  m;  that  is, 
testing  m=mi=(l-i^) 

m=mo=0. 

Since  this  study  wants  to  detect  the  mean  shift 
with  one  standard  deviation,  the  8  is  set  to  be  1  and  m, 
is  set  to  be  The  reason  why  this  study  sets  m, 

equal  (1-0)  is  because  of  the  following 

fact:  u  -  — ,  with  the  substitution  of  u  and  8  with 

1. 

By  using  Equations  (2),  (3),  and  (4),  one  is  able 
to  obtain  the  cuscore  statistics: 

Q  -  Y, 

t-1 


where  and  .  This  CC  is  Since  one  step  ahead  prediction  for  AR(1)  process  is 

used  to  signal  the  out  of  control  situation,  and  the  details  #  Equation  (6)  can  be  reformed  as: 

of  CC  can  be  referenced  in  [1]. 

AR(1)  Process  With  Mean  Shifts  ^  ”  S  (since  5  -  1)  .  (7) 

And  Cuscore  Charts 


Consider  the  case  in  which  the  underlying  Therefore,  by  viewing  Equation  (7),  one  knows  that  the 
process  can  be  modelled  as  an  AR(1)  process;  that  is,  the  yjjjj^ation  of  cuscore  statistics  is  equivalent  to  using 
process  follows  cusum  statistics  on  the  residuals. 
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ARMA(1,1)  Process  With  Mean  Shifts 
And  Cuscore  Charts 

Consider  the  case  in  which  the  underlying 
process  can  be  modelled  as  a  ARMA(1,1)  process;  that 
is,  the  process  follows 

yt  ‘  ,  (8) 


where  y,.,  is  output  deviation  from  the  target  at  time  t,  4> 
and  6  are  certain  parameters,  and  a,’s  stand  for  white 
noise. 

Now  consider  the  utilization  of  cuscore  control 
chart  to  detect  the  mean  shift.  Suppose  that  one  wants  to 
detect  the  mean  shift  which  has  the  magnitude  of  one 
standard  deviation  (i.e.,  again,  this  study  considers  that 
one  standard  deviation  equals  1),  then  the  underlying 
process  can  be  reformed  as 

Ye  -  in5+(t)yt-i+at-0at.i  .  (9) 


where  m  stands  for  some  unknown  parameter  and  8 
stands  for  the  magnitude  of  the  mean  shift.  If  the 
process  is  operated  correctly,  the  parameter  m  should  be 
zero  and  there  should  be  no  mean  shift  in  the  process. 
On  the  other  hand,  if  some  disturbances  exist  in  the 
process,  the  mean  shift  would  be  present  in  the  process. 
Therefore,  the  utilization  of  the  cuscore  chart  is 
equivalent  to  doing  a  hypothesis  test  for  the  parameter  m; 
that  is, 

testing  m=m|=(l-<^) 
m=mo=iO. 

Again,  since  one  wants  to  detect  the  mean  shift  with  one 
standard  deviation,  8  is  set  to  be  1  and  m,  is  set  to  be  (1- 
(j)).  The  reason  why  m,  equals  is  same  as  previous 
case. 

By  using  Equation  (2),  (3),  and  (4),  one  is  able 
to  obtain  the  cuscore  statistics: 


E  ^yt-^yt-i)  ^  • 

t-1 


(10) 


Since  in  this  study  one  wants  to  detect  the  mean 
shift  with  one  standard  deviation,  8  is  set  to  be  1,  mo=0, 
andm|=(l-^).  Thus, 


—  yt-(^^)^-^yt-x 

(1-0B)  ' 

and  ^  ,  . 

(1-0B) 

The  centred  cuscore  then  would  be: 

CCp  -  20CCt.i-02CCt.2+ 
t-i  ^ 


Simulation  Studies 

To  show  the  utilization  of  the  cuscore  statistics, 
this  research  examines  a  simulation  study.  Assume  that 
the  underlying  process  can  be  modelled  as  an 
ARMA(1,1)  process.  This  study  uses  Equation  (8)  to 
represent  a  certain  process,  and  a  white  noise  sequence 
which  has  mean  of  zero  and  a  standard  deviation  of  1  is 
generated  for  this  certain  process.  Since  this  study  wants 
to  detect  the  mean  shift  with  magnitude  of  one  standard 
deviation,  this  study  shifts  the  process  mean  from  0  to  1 
after  observation  25.  In  addition,  this  study  arbitrarily 
chooses  0=0.7  and  0=0.6. 

Figures  1,  2,  3,  and  4  display  the  process 
outputs  with  no  mean  shift,  residuals  of  process  output 
vrith  no  mean  shift,  process  outputs  with  mean  shift  1 
starting  at  observation  26,  and  residuals  of  process  output 
with  mean  shift  1  starting  at  observation  26,  respectively. 
Notice  that  the  prediction  of  the  ARMA(1,1)  process  is 

Tt  -  4>ye-i-0(yt-i-3vT)  • 

Therefore,  the  residual  of  the  ARMA(1,1)  is  calculated 
as  follows: 


Residual -  yf-y^  . 


Figure  1,  in  fact,  represents  an  ARMA(1,1) 
process  which  has  the  parameters  0=0.7  and  0=0.6  and 
has  a  white  noise  sequence  which  has  mean  of  zero  and 
a  standard  deviation  of  1 .  Since  there  is  no  mean  shift 
in  the  process,  the  residuals  plot  should  behave  like  a 
random  noise.  This  characteristic  can  be  seen  in  Figure 
2.  Figure  3  represents  an  ARMA(1,1)  process  with 
mean  shift  of  1  starting  at  observation  26.  Furthermore, 
since  there  is  a  mean  shift  in  the  process,  the  residuals 
plot  should  not  behave  like  a  random  noise.  This 
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characteristic  can  be  observed  in  Figure  4. 

Figure  5  shows  the  plot  of  cuscore  statistics, 
centred  cuscore  statistics,  and  summation  of  the  residuals 
of  process  output  when  the  ARMA(1,1)  process  has 
mean  shift  of  1  starting  at  observation  26.  Since  in  this 
study  the  process  has  no  mean  shift  occurring  before 
observation  26,  one  can  expect  that  the  values  of  cuscore 
statistics  and  the  summation  of  the  residuals  are  around 
zero  before  observation  26.  Also,  since  the  process  has 
a  mean  shift  of  1  starting  at  observation  26,  one  can 
expect  that  the  values  of  cuscore  statistics  and  the 
summation  of  the  residuals  are  positively  increased  over 
time  after  observation  26  (i.e.,  since  the  mean  shift  is  a 
positive  value.)  In  addition,  since  the  centred  cuscore  is 
used  to  signal  the  out  of  control  situation,  one  can  expect 
that  the  slope  of  the  centred  cuscore  would  be  changed 
some  certain  time  around  (or  after)  observation  26. 
These  characteristics  are  all  shown  in  Figure  5. 

Furthermore,  one  can  observe  that  the  minimum 
point  of  centred  cuscore  statistics  line  is  at  observation 
25.  The  meaning  of  the  minimum  point  is  that  the  slope 
of  the  centred  cuscore  line  is  changed.  One  should  take 
the  difference  between  the  value  of  the  centred  cuscore 
at  time  t  (i.e.,  the  time  after  the  occurring  minimum 
point)  and  this  minimum  point  to  determine  whether  the 
out  of  control  signal  is  given  or  not.  For  example,  in 
this  simulation  study,  the  minimum  point,  which  is  equal 
to  -20.587,  occurs  at  observation  25.  Therefore,  one  can 
expect  that  the  out  of  control  signal  is  given  at 
observation  34  since  the  value  of  the  difference  between 
cuscore  statistics  (at  observation  16)  and  the  minimum 
point  is  15.62,  and  this  value  is  greater  than  the 
boundary  h,  15.35.  The  boundary  h  is  defined  as  [1]: 

[o^  ln(  — )  3 

_  _ g 

(OTi-mo) 

Therefore,  h=  15.35  if  one  chooses  q;=0.01. 

Summary 

This  study  is  concerned  with  the  utilization  of 
cuscore  control  chart  in  an  open  loop  process.  The 
concept  of  using  cuscore  control  chart  is  discussed  for 
two  open  loop  processes,  AR(1)  and  ARMA(1,1).  In 
addition,  the  simulation  study  demonstrates  how  to  use 
the  cuscore  control  chart  to  detect  the  mean  shift  in  the 
process. 

The  objective  of  this  paper  is  to  show  that  the 
cuscore  control  chart  is  useful  in  detecting  the  mean  shift 


or  transient  disturbance.  Ongoing  research  is  developing 
a  technique  for  detecting  the  mean  shift  or  transient 
disturbance  in  real  time  for  a  closed  loop  process. 


Figure  1: 

Plot  of  output  deviations  from  target  which  are 
generated  by  an  ARMA(1,1)  process  with  no  mean 
shift 


Figure  2: 

Plot  of  residuals  which  are  generated  by  an 
ARMA(1,1)  process  with  no  mean  shift 
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Observ'aiion 


Figure  3: 

Plot  of  output  deviations  from  target  which  are 
generated  by  an  ARMA(1,1)  process  with  mean 
shift  1  starting  at  observation  26 
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Figure  4: 

Plot  of  residuals  which  are  generated  by  an 
ARMA(1,1)  process  with  mean  shift  1  starting  at 
observation  26 


Observation 

Figure  5: 

Plot  of  cuscore,  centred  cuscore,  and  summation  of  the 

residuals  of  process  output  when  the  ARMA  (1,1) 

process  has  mean  shift  1  starting  at  observation  26 
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Abstract 

A  tree-based  method  for  censored  survival  data  with 
time-dependent  covariates  is  proposed.  A  likelihood  es¬ 
timation  procedure  is  used  in  the  recursive  partitioning 
algorithm  to  grow  trees.  Time-dependent  covariates  are 
incorporated  in  the  partitioning  procedure  under  a  piece- 
wise  proportional  hazards  structure.  If  time- depen  dent 
covariates  are  present,  the  estimated  hazard  at  a  node 
gives  the  relative  risk  for  a  group  of  individuals  during 
a  specific  time  period.  Both  cross-validation  and  boot¬ 
strap  resampling  techniques  are  implemented  in  tree  se¬ 
lection  procedure.  The  performance  of  the  model  is  in¬ 
vestigated  through  simulation  and  application  on  real 
data. 

1  Introduction 

The  tree-based  methods  are  originally  used  in  the  re¬ 
gression  and  classification  (Breiman,  Friedman,  Olshen, 
and  Store,  1984),  later  on  the  principle  is  adapted  to 
censored  survival  data.  The  need  of  tree-based  methods 
for  survival  data  comes  from  clinical  investigators  who 
usually  are  interested  in  grouping  patients  with  differing 
interpret  able  prognoses. 

Including  time- depen  dent  covariates  in  the  survival 
analysis  leads  to  dynamic  prognosis,  where  the  esti¬ 
mated  risk  of  the  patient’s  survival  may  change  from 
one  time  point  to  the  next  as  the  values  of  the  covariates 
change.  The  investigation  of  time- dependent  covariates 
in  survival  analysis  has  received  considerable  attention 
recently  in  both  the  statistical  and  biomedical  literature 
(Cox  and  Oakes,  1984;  Andersen,  1991). 

LeBlanc  and  Crowley  (1992)  extended  the  propor¬ 
tional  hazards  regression  to  tree-structured  relative  risk 
estimates  for  censored  survival  data  with  one-step  full 
likelihood  estimation  procedure.  This  method  works 
well  with  time-independent  covariates.  Other  tree-based 
methods  have  also  been  proposed  for  analyzing  survival 


data.  Gordon  and  Olshen  (1985)  presented  a  method 
using  distance  measures  between  Kaplan-Meier  curves 
and  their  nearest  continuous  approximation.  Davis  and 
Anderson  (1989)  proposed  a  method  based  on  the  expo¬ 
nential  log-likelihood  structure.  Segal  (1988)  presented 
a  totally  nonparametric  application  using  the  Tarone- 
Ware  or  Harrington-Fleming  classes  of  two-sample  rank 
statistics.  LeBlanc  and  Crowley  (1993)  developed  a  re¬ 
cursive  partitioning  procedure  based  on  maximizing  the 
dissimilarity  in  the  survival  distributions  of  patients  be¬ 
tween  regions  of  the  covariate  space.  Existing  survival 
trees  methods  are  only  suitable  for  dealing  with  censor¬ 
ing  data  with  time-independent  covariates.  Few  have 
been  done  for  time- depen  dent  covariates. 

In  this  paper,  based  on  LeBlanc  and  Crowley’s  work 
(1992),  we  propose  a  model  which  accommodates  time- 
dependent  covariates  into  piecewise  proportional  hazards 
survival  trees  for  censored  survival  data  (Huang,  un¬ 
published  Ph.D.  dissertation,  Department  of  Biostatis¬ 
tics,  University  of  Alabama  at  Birmingham,  1994).  This 
methods  splits  nodes  through  the  product  space  of  the 
covariate  and  time,  and  establish  measures  of  improve¬ 
ment  based  on  piecewise  proportional  hazards.  The  esti¬ 
mated  hazard  function  or  the  estimated  proportionality 
at  each  branch  node  summarize  the  risk  of  a  group  of 
individuals  during  each  specific  time  period.  The  next 
section  briefly  describes  the  basic  ideas  of  the  piecewise 
proportional  hazards  survival  trees.  Section  3  investi¬ 
gates  the  proposed  method  based  on  simulation  studies. 
Section  4  exemplifies  the  method  by  an  analysis  of  the 
UAB  Localized  Melanoma  Data.  Section  5  gives  some 
discussion. 

2  A  New  Survival  Trees  Method 

Generally,  tree-based  methods  recursively  partition  the 
covariate  space  into  disjoint  regions  and  the  correspond¬ 
ing  data  into  groups.  For  each  split  node  some  measure 
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of  separation  in  the  response  distribution  between  the 
two  (laughter  nodes  is  calculated,  Although  many  types 
of  partitions  could  be  considered,  we  will  consider  only 
splits  on  a  single  variable  at  a  time,  which  is  easily  gen¬ 
eralized  to  combinations  of  covariates.  All  possible  splits 
for  each  of  the  covariates  are  evaluated,  and  the  variable 
to  be  split  and  the  split  point  are  chosen  to  best  separate 
the  nodes.  The  same  procedure  is  applied  recursively  to 
increase  the  number  of  nodes  until  each  contains  only 
a  few  observations.  The  resulting  model  can  be  repre¬ 
sented  as  a  binary  tree.  After  a  large  tree  is  grown,  there 
are  rules  for  recombining  nodes  and  for  readjusting  the 
size  of  the  tree. 

LeBlanc  and  Crowley’s  (1992)  relative  risk  trees 
adopts  the  proportional  hazards  model  which  specifies 
the  following  hazard  function  at  time  t,  for  an  individ¬ 
ual  with  covariate  vector  z 

A(<lz(-))  =  Ao(05(z),  (1) 

where  s(z)  >  0  and  Ao(<)  is  the  unknown  baseline  hazard. 
The  first  step  of  a  full  likelihood  estimation  procedure 
is  used  in  a  recursive  partition  algorithm  to  grow  the 
tree.  If  the  covariate  vector  z  also  changes  with  time, 
obviously,  we  can  generalize  (1)  to  a  more  general  form, 
that  is 

A(f|z(-))  =  Ao(0s(z(*))-  (2) 

The  trees  method  we  propose  is  to  split  nodes  through 
the  product  space  of  the  covariate  and  time  based  on 
a  rule  to  minimize  a  loss  function  that  is  defined  by 
the  log  likelihood  of  piecewise  proportional  hazards  as¬ 
sumptions.  If  there  are  only  time-independent  covari¬ 
ates  to  be  considered  to  associate  with  failure  time,  this 
method  reduces  to  LeBlanc  and  Crowley’s  (1992)  rela¬ 
tive  risk  trees.  However,  we  can  always  add  an  auxiliary 
time-dependent  covariate  to  monitor  the  change  of  haz¬ 
ards  with  time.  If  time- dependent  co variates  are  also 
involved,  our  new  algorithm  may  give  different  piecewise 
proportional  hazards  survival  estimates  for  different  in¬ 
dividuals.  Even  for  the  same  individual  in  different  time 
periods,  he  or  she  may  be  partitioned  to  different  nodes. 

If  we  consider  time-dependent  covariates  that  asso¬ 
ciate  with  the  event  time,  the  proposed  piecewise  pro¬ 
portional  hazards  trees  method  approximates  the  pro¬ 
portional  hazards  model  (2)  with  the  following  hazard 
function 

Ao(f)^l2J  ^ii  ^  *^*3) 

:  : 

Ao(f)^U>  ^  ^  fifcj 


where  Ao(f)  is  the  baseline  hazard  function  and 
...,  0ij^  are  positive.  For  mathematical  convenience,  some 
of  Uj  may  be  defined  as  oo  so  that  Ai(f)  may  have  less 
than  k  pieces. 

For  simplicity,  we  illustrate  the  case  with  only  one 
time-dependent  covariate  which  is  assumed  to  be  mono- 
tonically  increasing  in  time  for  each  individual.  Sup¬ 
pose  we  choose  a  split  point  Si  for  a  time-dependent 
CO  variate  Z{t).  There  are  three  possible  relationships 
between  Zi{t)  and  5i  for  each  individual:  (1)  Zi{t)  < 
Su  0  <  t  <  Xi]  (2)  Zi{t)  <  Si,  0  <  t  <  ti^i  and 
Zi{t)  >  Si,  ti^i  <t  <  Xi\  (3)  Zi{i)  >  Si,  0  <t<  Xi. 

We  start  to  grow  our  survival  tree  from  the  root  node 
Fi,  which  consists  of  the  whole  sample  based  on  a  con¬ 
stant  risk  for  all  individuals.  Under  the  piecewise  pro¬ 
portional  hazards  assumption,  it  follows  that  the  contri¬ 
butions  of  the  left  and  right  daughters  of  root  node  Fi 
to  the  likelihood  are 

K^l)=  n  (•^o(a:t)^i)'’‘exp(-Ao(xj)e() 
ie/i 

n  exp(-Ao(ti_i)0i), 

»€/3 

where  Ii  =  {i  |  2i(0  <  -5i>  0  <  *  £  h  =  | 

<Si,0<t<  Zi(xi)  >  5i},  and 

K^r)=  n  (■^o(a:.)^r)^‘exp(-(Ao(!C,)-Ao(<i,i))0r) 
i€l2 

n  exp(-Ao(a:,)0r), 

*€/3 

where  I3  =  {i  |  Si  <  Zi{t),  0  <  <  <  x,}.  Ao{t)  is  the 
baseline  cumulative  hazard  function.  Hence  the  likeli¬ 
hood  function  is 

If  we  know  the  baseline  cumulative  hazard,  the  maxi¬ 
mum  likelihood  estimates  of  0i  and  0r  are 


Ao(a:,')-f  ^  Ao(t*,i) 
ieh  iei7 

and 

E  E 

^ _ ig/a  i€h _ 

E  (-^0(2;, •)  —  Ao(i:,i))+  E 

16/3  *€-^3 
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However,  since  we  do  not  know  the  cumulative  haz¬ 
ard,  a  natural  estimator  of  the  cumulative  hazard  given 
estimates  0]  and  0,., 

Ao(0  =  ^ 

is  used.  Similar  to  LeBlanc  and  Crowley’s  model  (1992), 
only  the  first  iteration  will  be  used  in  the  recursive  parti¬ 
tioning  procedure  to  grow  the  tree.  The  Breslow  estima¬ 
tor  evaluated  at  =  1  and  0,,  =  1,  which  is  tl^  Nelsra 
(1969)  cumulative  hazard  estimator,  is  used.  0i  and  6r 
can  be  interpreted  as  the  observed  number  of  deaths  di¬ 
vided  by  the  expected  number  of  deaths  when  Zi{t)  <  Si 
and  when  Zi{t)  >  Siy  respectively. 

After  a  survival  tree  is  grown  based  on  the  above  pro¬ 
cedure,  an  estimated  cumulative  hazard  function  is  also 
obtained  by  iteration.  Then,  we  take  this  estimated  cu¬ 
mulative  hazard  function  as  the  given  baseline  cumula¬ 
tive  hazard  to  grow  another  tree  from  the  root  node,  and 
pruning  and  tree  selection  will  be  based  on  this  second 
tree. 

If  time- dependent  covariates  are  included  in  splitting 
and  growing  a  survival  tree,  it  may  no  longer  true  that 
every  split  will  create  two  exclusive  individual  groups 
according  to  the  value  of  the  covariate.  When  time- 
dependent  covariates  exist,  the  ratios  of  estimated  haz¬ 
ards  between  nodes  are  used  to  summarize  the  relative 
risk  of  a  group  of  individuals  during  a  specific  time  pe¬ 
riod,  and  as  the  tree  grows,  the  relative  risk  functions 
may  have  many  pieces. 

As  with  CART,  a  nested  sequence  of  subtrees  is  de¬ 
fined  by  minimal  cost-complexity  pruning.  The  cross- 
validation  and  bootstrap  resampling  (Efron,  1982)  are 
used  to  make  “honest”  estimates  of  the  loss  associated 
with  each  tree  in  the  sequence  and  the  final  tree  is  se¬ 
lected  based  on  these  estimates. 

3  Simulation  Studies 

In  order  to  investigate  the  performance  of  the  piecewise 
proportional  hazards  trees,  simulation  studies  were  con¬ 
ducted.  In  this  section,  procedures  and  results  of  the 
simulation  experiments  are  reviewed.  The  comparison 
among  the  trees  model  and  the  Cox  proportional  haz¬ 
ards  regression  (Cox,  1972)  with  time- dependent  covari¬ 
ates  are  studied.  The  proposed  tree  and  the  Cox  model 
are  applied  to  three  types  of  random  samples  which  are 
to  be  examined  in  different  perspectives. 

Each  of  the  three  simulations  was  designed  from  dif¬ 
ferent  perspectives.  In  the  first  simulation,  the  random 


samples  were  generated  by  piecewise  exponential  distri¬ 
butions  with  non-monotonous  underlying  hazards  asso¬ 
ciated  with  a  time-dependent  covariate.  In  the  second 
simulation,  the  random  samples  were  generated  by  piece- 
wise  exponential  distributions  with  monotonous  underly¬ 
ing  hazards  associated  with  a  time-independent  covariate 
and  a  time- dependent  covariate.  In  the  third  simulation 
the  random  samples  were  generated  by  Weibull  distri¬ 
butions  with  mixed  underlying  hazards  associated  with 
a  time-independent  covariate,  and  an  auxiliary  time- 
dependent  covariate  was  added  for  assessing  nonconstant 
hazard  functions.  The  reason  for  doing  this  was  to  ex¬ 
amine  the  capability  of  the  tree  method  dealing  with 
different  survival  distributions.  The  Cox  proportional 
hazards  regression  model  was  also  applied  to  each  of  the 
random  samples  from  three  survival  distributions  for  the 
purpose  of  comparison.  In  each  simulation,  one  hundred 
random  samples  were  generated. 

The  sample  sizes  are  400  for  the  first  and  the  third 
simulations  and  450  observations  for  the  second  simula¬ 
tion.  The  average  censoring  rate  among  three  simulation 
studies  is  approximately  20%.  The  minimum  terminal 
node  size  permitted  for  splitting  was  20  observations. 
The  right  pruned  subtree  was  selected  in  each  method 
by  minimizing  the  ten-fold  cross-validation  and  the  boot¬ 
strap  estimates  of  the  prediction  error. 

The  piecewise  proportional  hazards  trees  performed 
well  in  all  three  survival  distributions  and  basically  re¬ 
covered  the  changes  of  the  hazard  rates.  The  Cox  pro¬ 
portional  hazards  regression  model  totally  failed  in  two 
of  our  three  simulations  and  seriously  underestimated  in 
one  simulation.  In  the  tree  selection,  the  cross-validation 
procedure  tended  to  underestimate  the  prediction  error. 

In  the  simulation,  the  proposed  trees  method  with  the 
add-on  of  auxiliary  time- dependent  covariates  showed 
some  strength.  Our  example  demonstrated  that  with 
an  auxiliary  time-dependent  covariate  the  proposed  trees 
method  was  capable  of  detecting  underlying  hazards, 
which  were  changing  with  time.  As  the  exploratory 
methods,  the  proposed  trees  are  superior  to  some  previ¬ 
ous  trees  methods  even  when  time-dependent  covariates 
are  not  existent. 

4  Example 

Survival  for  patients  with  cutaneous  melanoma,  a  cancer 
of  the  skin,  is  strongly  associated  with  a  number  of  clini¬ 
cal  and  pathological  factors,  in  the  past  two  decades,  ex¬ 
tensive  studies  have  been  done  and  remarkable  progress 
has  been  made  in  the  identification  of  dominant  factors 
that  affect  the  outcome  of  melanoma  (Belch,  Houghton, 
Sober,  Milton,  and  Soong,  1992).  Using  the  multi vari- 
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ate  regression  analysis  methods  for  survival  data,  tu¬ 
mor  thickness  at  diagnosis,  tumor  ulceration,  invasion 
level,  and  lesion  location  are  found  to  be  the  key  prog¬ 
nostic  factors  for  localized  melanoma.  Additionally,  a 
fine  model  has  also  been  developed  recently  to  predict 
survival  and  recurrence  in  localized  melanoma  (Soong, 
Shaw,  Balch,  McCarthy,  Urist,  and  Lee,  1992). 

In  this  section,  we  reanalyze  the  University  of  Al¬ 
abama  at  Birmingham  (UAB)  localized  melanoma  data 
using  the  piecewise  proportional  hazards  trees  method. 
In  cancer  clinical  trials,  discussion  has  largely  been  re¬ 
stricted  to  the  analysis  of  mortality  data,  where  each 
patient  is  classified  as  dead  or  alive  (censored),  or  to  the 
analysis  of  disease-free  survival  data,  where  each  patient 
is  classified  as  either  disease-free  or  not.  From  a  differ¬ 
ent  perspective,  with  recurrence  as  one  of  the  potential 
prognostic  factors,  in  this  analysis  we  would  like  to  see 
the  possible  dynamic  impact  of  recurrence  and  other  fac¬ 
tors  on  survival  in  localized  melanoma.  Recently,  some 
studies  have  been  done  by  considering  multistate  models 
instead  of  the  simple  two  state  models  for  survival  data 
(Andersen,  1988).  Multistate  models  provide  a  flexible 
framework  for  the  study  of  the  effects  of  covariates  on 
several  transition  rates  and  important  biological  insight 
may  be  gained  from  the  analysis  of  such  a  model. 

The  analysis  presented  here  is  based  on  702  localized 
melanoma  patients  from  the  Surgical  Oncology  Service 
at  the  University  of  Alabama  at  Birmingham  from  1955 
to  1980.  Patients  have  been  referred  primarily  from  Al¬ 
abama,  with  some  coming  from  the  surrounding  states  of 
Florida,  Mississippi,  Tennessee,  and  Georgia.  Approxi¬ 
mately  78.6%  of  the  patients  had  censored  survival  times. 
Four  clinical  and  pathological  factors  previously  known 
to  be  associated  with  survival  are  included  in  the  analysis 
as  time-independent  covariates.  They  are  tumor  thick¬ 
ness,  lesion  location,  ulceration,  and  level  of  invasion. 

Tumor  thickness  has  values  1  to  6  which  are  coded 
for  tumor  less  than  0.76mm  thick,  between  0.76mm  and 
1.49mm  thick,  between  1.50mm  and  2.49mm  thick,  be¬ 
tween  2.50mm  and  3.99mm  thick,  between  4.00mm  and 
7.99mm,  and  more  than  8.00mm  thick,  respectively.  Le¬ 
sion  location  has  values  0  and  I  corresponding  with  ex¬ 
tremity  and  axial.  Ulceration  with  values  0  and  1  means 
no  and  yes.  Invasion  with  value  0  represents  level  II, 
and  1  represents  level  III,  IV  and  V.  In  addition,  we 
treat  recurrence  as  a  time-dependent  covariate  based  on 
multiple  measures  recorded  at  each  time  of  recurrence. 

Clinically,  the  severity  of  melanoma  is  defined  in  three 
stages.  If  recurrence  occurs,  it  must  be  one  of  the  three 
clinical  stages.  Each  measure  of  the  recurrence  in  the 
analysis  depends  on  the  clinical  stages  of  melanoma 
recurrence.  Therefore,  we  code  recurrence  as  a  step- 


Figure  1:  Survival  Tree  of  Localized  Melanoma  Data 

function  of  the  time  that  takes  values  1,  2  and  3  cor¬ 
responding  to  the  clinical  stages.  The  recurrence  time 
is  included  in  this  covariate.  Three  follow-up  melanoma 
recurrences  are  used  to  construct  the  time-dependent  co¬ 
variate.  In  other  words,  each  patient  has  a  maximum  of 
three  possible  measures  for  melanoma  recurrence. 

Trees  were  grown  with  a  minimum  node  size  of  25 
patients.  The  reason  for  doing  so  is  that  we  are  not 
interested  in  extremely  small  prognostic  groups.  The 
piecewise  proportional  hazards  trees  grown  and  selected 
a  survival  tree  with  six  terminal  nodes  as  shown  in  Fig¬ 
ure  1,  where  the  number  of  patients  are  inside  the  upper 
level  of  the  nodes  and  the  estimated  proportionalities 
are  inside  the  lower  level  of  the  nodes.  The  first  split 
to  the  tree  was  on  tumor  thickness  with  less  0.76mm 
thick  versus  thicker.  The  next  split  was  on  recurrence 
with  clinical  stage  3  versus  other.  For  patients  who  had 
clinical  stages  1  or  2  recurrence,  the  split  was  on  tu¬ 
mor  thickness  again  with  between  0.76mm  and  1.49mm 
thick  versus  thicker  and  again  with  1.50mm  and  2.49mm 
thick  versus  thicker.  Finally,  for  patients  who  had  clini¬ 
cal  stage  3  recurrence,  the  split  was  on  extremity  or  axial 
lesions. 

The  results  were  consistent  with  the  previous  analyses 
that  tumor  thickness  and  lesion  location  were  important 
prognostic  factors.  Based  on  these  brief  analyses,  there 
is  evidence  to  show  that  time- depen  dent  covariate  recur¬ 
rence  also  is  a  key  prognostic  factor.  We  can  see  clearly 
that  patients  who  had  the  recurrence,  and  it  changed 
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from  local  to  distant  sites,  would  have  much  high  risk  of 
death. 

Due  to  the  adjustment  of  a  baseline,  the  final  piece- 
wise  proportional  hazard  tree  shows  that  the  tumor 
thickness  is  the  most  important  predictor  of  the  clini¬ 
cal  course.  Patients  who  had  tumor  less  than  0.76mm 
thick  had  the  best  prognosis  whether  or  not  they  had 
melanoma  recurrence.  In  contrast,  patients  who  had  a 
thick  tumor  had  a  worse  prognosis.  However,  patients 
who  had  a  thick  tumor  could  have  been  divided  accord¬ 
ing  to  recurrence.  Patients  with  stage  2  recurrence  or 
less  were  doing  better  than  patients  with  stage  3  recur¬ 
rence,  although  their  prognosis  still  could  been  assessed 
by  tumor  thickness  in  which  the  thicker  the  tumor,  the 
worse  the  prognosis.  Finally,  patients  who  had  stage  3 
recurrence  still  could  been  partitioned  with  primary  le¬ 
sion  site.  Patients  with  axial  melanomas  had  the  worst 
prognosis.  Again,  the  analysis  reveals  a  certain  interac¬ 
tion  among  these  significant  factors. 

5  Discussion 

Parallel  to  relative  risk  trees  (LeBlanc  and  Crowley, 
1992),  based  on  a  piecewise  proportional  hazard  struc¬ 
ture,  a  new  survival  trees  methods  has  been  proposed 
to  appropriately  handle  time-dependent  covariates.  As 
more  flexible  alternatives  to  the  previous  works  (LeBlanc 
and  Crowley,  1992),  if  no  time-dependent  covariates  are 
included  in  the  data,  an  auxiliary  time-dependent  covari- 
ate  could  be  created  to  monitor  the  change  of  the  haz¬ 
ard.  Even  when  time- dependent  covariates  are  present, 
including  such  a  covariate  might  fit  the  model  better. 

Simulations  were  conducted  on  each  of  the  three  data 
patterns  with  100  repetitions,  respectively.  The  Cox  pro¬ 
portional  hazards  regression  model  was  also  applied  to 
the  random  samples  for  comparison.  The  proposed  trees 
method  performed  well  on  simulated  data. 

The  UAB  localized  melanoma  data  set,  with 
melanoma  recurrence  as  a  time-dependent  covariate  and 
other  factors  as  time-independent  co variates,  was  ana¬ 
lyzed  by  the  proposed  trees  method.  The  melanoma  re¬ 
currence  was  found  to  be  a  dynamic  prognostic  factor 
that  affected  survival.  Patients  who  had  a  recurrence 
that  changed  from  local  to  distant  sites  would  have  little 
chance  of  surviving. 

We  emphasize  the  importance  and  the  necessity  of 
including  time- dependent  covariate  in  survival  analysis. 
With  time-dependent  covariates,  the  analysis  is  led  to 
dynamic  prognosis.  We  also  realize  the  difficulty  and 
the  complication  of  involving  time-dependent  covariates 
in  the  analysis,  it  is  noted  that  time-dependent  covari¬ 
ates,  in  principle,  are  easy  to  be  included  in  the  Cox 


model,  but  strict  data  requirements  may  have  prevented 
widespread  use  of  the  Cox  regression  model  with  time- 
dependent  covariates.  However,  the  main  problem  is 
that  the  interpretation  of  the  results  from  a  model  with 
time-dependent  covariates  is  less  obvious.  Although  we 
have  introduced  the  survival  trees  models  with  time- 
dependent  covariates,  the  ways  we  interpret  the  results 
might  not  be  the  only  one  or  the  best  one.  A  great  deal 
of  work  remains  to  be  done. 
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Abstract 

Prospective  studies  often  involve  rare  events  as  study 
outcomes,  and  it  is  of  primary  concern  to  identify  risk 
factors  and  risk  groups  associated  with  the  outcomes. 
Practical  solutions  to  risk  factor  analyses  in  prospective 
studies  are  discussed.  We  address  strategies  to  determine 
tree  structures,  to  estimate  relative  risks,  and  to  man¬ 
age  missing  data  in  connection  with  some  important  epi¬ 
demiological  problems.  Some  of  the  basic  ideas  behind 
our  strategies  follow  from  work  of  Breiman,  Friedman, 
Olshen,  and  Stone  (1984)  although  we  propose  exten¬ 
sions  to  their  methods  in  order  to  resolve  some  practical 
problems  that  arise  in  implementing  these  methods  in 
epidemiologic  studies. 

1  Introduction 

Rare  events  or  diseases,  such  as  AIDS  and  birth  defects, 
are  common  targets  in  epidemiologic  studies.  Accom¬ 
panying  the  study  outcome,  data  on  a  number  of  pu¬ 
tative  risk  factors  and  covariates  are  typically  gathered. 
The  goal  is  to  identify  risk  factors  associated  with  the 
outcome.  Logistic  and  log-linear  regressions,  unified  as 
generalized  linear  models  (GLM),  are  popular  statistical 
tools  for  analyzing  these  studies.  One  key  element  in 
the  GLM  is  the  link  function  between  the  log-odds  of 
the  events  and  a  linear  form  of  covariates  and  parame¬ 
ters;  see  McCullagh  and  Nelder  (1989)  for  an  excellent 
discussion  on  the  subject.  The  GLM  is  very  attractive  in 
applications  for  many  reasons  such  as  the  simplicity  of 
the  linear  models  and  the  interpretability  of  the  param¬ 
eters  in  the  logistic  models.  In  this  paper,  we  present 
an  alternative  nonparametric  approach  which  is  more 
appropriate  and  flexible  in  many  instances  because  this 
approach  does  not  rely  on  most  of  the  restrictive  assump¬ 
tions  made  in  the  GLM. 

The  tree-based  method  is  useful  to  explore  data  when 
there  are  large  numbers  of  variables  and  considerable 
missing  information,  and  when  a  linear  combination  of 
covariates  does  not  have  an  intuitive  interpretation.  In 


particular,  epidemiologic  studies  often  have  categorical 
variables  that  do  not  have  meaningful  linear  combina¬ 
tions.  The  tree-based  method  indentifies  risk  factors  by 
specifying  groups  at  risk.  In  the  following  discussion, 
a  familiarity  with  the  work  of  Breiman  ei  al  (1984)  is 
assumed. 

2  The  Tree-based  Method 

2.1  Determination  of  Tree  Structures 

Statistical  inference  is  typically  based  on  optimality  cri¬ 
teria  such  as  maximum  likelihood.  In  the  context  of  the 
tree-based  methods,  we  use  an  impurity  function.  An  ad¬ 
vantage  of  this  is  that  the  statistical  outputs  are  "best” 
in  some  specified  sense.  The  disadvantage  is  that  the 
results  may  not  be  intuitive  and  convenient  to  interpret. 
We  now  explain  situations  for  which  adjustments  may 
need  to  be  made  for  tree  structures.  Therefore,  a  re¬ 
pairing  step  may  be  required  for  the  tree-based  method. 
The  repairing  may  be  done  during  the  tree  growing  step 
and/or  after  the  tree  is  pruned. 

In  a  tree  the  same  co variate  may  be  used  to  split  more 
than  one  nodes  while  the  cut-off  points  are  different  but 
close.  We  would  naturally  question  whether  the  cut-off 
points  are  indeed  different.  This  can  be  answered  sta¬ 
tistically  using  significance  tests  or  clinically  according 
to  whether  one  really  cares  about  that  difference.  If  the 
answer  is  negative,  one  may  want  to  force  the  cut-off  to 
be  the  same  so  that  the  interpretation  is  simpler.  Now 
suppose  that  a  split  on  Xi  is  suggested  by  the  optimality 
criterion  in  constructing  a  tree,  but  Xj  is  very  compet¬ 
itive  in  terms  of  the  impurity  of  the  resulting  left  and 
right  nodes.  When  there  are  other  good  reasons  (e.g., 
the  reliability  and  nature  of  the  measurement)  to  use 
Xj,  the  user  may  prefer  xj  to  Xi.  Moreover,  if  we  use, 
say,  cigarettes  smoked  as  a  covaxiate,  suppose  that  the 
computer  splits  the  population  according  to  whether  one 
smoked  at  least  19  cigarettes  per  day.  It  may  make  more 
sense  to  use  20  instead  of  19  because  20  corresponds  to 
a  pack  of  cigarettes. 
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It  has  been  observed  (cf.  Breiman  et  a/,,  pp.  313-317) 
that  the  splitting  rule  used  in  the  tree  growing  step  tends 
to  favor  end-cut  splits.  To  avoid  the  end-cut  preference 
problem,  a  solution  provided  by  Breiman  ei  al.  is  to 
take  a  different  splitting  criterion.  A  much  less  technical 
solution  could  be  replacing  the  suspicious  split  with  a 
competitive  one  that  does  not  suffer  from  this  problem. 
This  is  possible  in  our  implementation  of  the  tree-based 
method  by  allowing  the  user  the  option  of  selecting  their 
own  splits. 

2.2  Tree  Pruning 

Breiman  ei  al  (1984)  described  an  automated  pruning 
procedure  via  cross  validation.  An  implicit  assumption 
for  this  automated  procedure  is  that  the  grown  tree  is 
int acted.  As  discussed  in  the  previous  section,  the  tree 
produced  by  an  automated  procedure  may  not  be  satis¬ 
factory  and  certain  repairs  may  be  needed.  This  would 
violate  the  premise  under  which  the  cross  validation  is 
used.  To  address  this  concern,  we  change  the  pruning 
step  for  the  present  study  as  follows.  Since  we  are  aim¬ 
ing  at  finding  high  risk  individuals  in  a  population,  we 
will  prune  off  a  node  if  the  risks  for  the  left  and  right 
daughters  are  not  significantly  different  at  a  nominal  sig¬ 
nificance  level  in  terms  of  the  relative  risk.  In  epidemiol¬ 
ogy,  a  significance  level  of  0.05  is  usually  an  acceptable 
choice.  In  our  pruning  step,  repeated  significance  tests 
are  actually  performed.  A  lower  significance  level  may 
be  more  appropriate  if  our  purpose  is  to  test  certain 
hypotheses.  However,  we  use  the  significance  test  as  a 
tool  to  select  splits  in  the  same  way  as  in  linear  regres¬ 
sion  one  uses  the  significance  test  to  select  variables  via 
stepwise  procedures.  After  the  determination  of  the  tree 
structure,  our  main  purpose  is  to  generate  hypotheses 
for  future  studies.  For  this  reason,  it  is  not  critical  to 
use  a  ‘‘perfectly”  rationale  choice. 

Start  with  a  level  of  0.05  and  prune  off  a  pair  of  left 
and  right  nodes  from  the  bottom  of  the  big  tree  if  we  can¬ 
not  reject  the  hypothesis  that  the  relative  risk  (estimated 
from  the  resubstitution  method)  for  the  two  nodes  equals 
1  at  this  significance  level.  This  yields  a  primary  tree, 
which  usually  has  a  reasonable  size.  Next,  we  examine 
the  primary  tree  to  see  (a)  which  splits  are  superficial 
by  estimating  the  relative  risk  using  cross  validation  as 
described  below;  (b)  which  splits  may  be  scientifically 
uninterpretable  by  reviewing  the  literature;  (c)  which 
splits  may  need  more  data  to  justify.  After  this  exami¬ 
nation  step,  we  have  a  final  tree.  Furthermore,  one  may 
use  this  final  tree  to  explore  alternative  trees.  There¬ 
fore,  our  pruning  procedure  is  not  completely  automatic 
and  we  deliberately  leave  room  for  users  to  apply  their 
knowledge  of  the  data.  See  Zhang  and  Bracken  (1994) 


for  an  application  of  this  procedure. 


2.3  Risk  Estimation 


In  the  preceding  section,  the  relative  risk  is  used  to  prune 
an  over-grown  tree.  The  impurity  function  used  to  se¬ 
lect  splits  is  closely  related  to  the  relative  risk.  Hence, 
a  split  of  low  impurity  tends  to  result  in  a  high  relative 
risk.  This  suggests  that  the  resubstitution  estimate  of 
the  relative  risk  may  be  biased  upward  because  impu¬ 
rity  was  the  selection  criterion.  Despite  the  bias,  the 
resubstitution  estimates  are  still  useful  for  pruning  the 
large  tree  although  they  are  not  reliable  for  interpreting 
the  final  tree.  Because  the  resubstitution  estimates  are 
upward  biased,  the  splits  of  a  tree  tend  to  be  more 
statistically  significant  than  they  really  are.  We  expect 
that  the  number  of  terminal  nodes  after  deletion  using 
the  resubstitution  estimates  is  larger  than  that  resulting 
from  more  realistic  estimates,  e.g.,  the  cross  validation 
method  as  described  shortly. 

To  correct  the  bias  in  the  resubstitution  estimates, 
we  describe  an  alternative  method  using  cross  validation 
locally.  It  is  based  on  the  idea  that  a  fair  estimate  of 
relative  risk  may  be  derived  from  another  data  set  that 
has  been  collected  under  similar  conditions. 

Breiman  ei  al  (pp.  150-155,  1984)  proposed  an  ad 
hoc  but  well-designed  cross  validation  procedure  to  cal¬ 
culate  within  node  misclassification  rates.  Unfortunate¬ 
ly,  their  procedure  is  not  directly  applicable  for  the  esti¬ 
mation  of  relative  risk  because:  (a)  the  procedure  focuses 
on  the  node  instead  of  the  splitting  variable;  (b)  the  rel¬ 
ative  risk  may  be  derived  from  the  misclassification  rates 
if  we  assign  the  unit  misclassification  cost  that  is  obvi¬ 
ously  inappropriate  for  the  present  application;  and,  (c) 
the  global  cross  validation  is  not  applicable  when  repairs 
must  be  made  to  the  grown  tree. 

The  local  cross  validation  method  proceeds  as  fol¬ 
lows.  First,  we  randomly  divide  the  population  of  in¬ 
terest  into  V  sub-populations.  For  instance,  we  may 
take  V  =  5.  Let  Ci  (i  =  1,2, 3, 4, 5)  denote  the  5  sub¬ 
populations.  First,  we  leave  Ci  alone  and  use  U^Ci  to 
select  the  split  s*  based  on  variable  x.  It  is  conceptu¬ 
ally  important  to  note  that  the  split  s*  is  searched  only 
over  the  variable  that  has  already  been  chosen.  This  re¬ 
striction  is  enforced  in  particular  to  address  the  effect 
of  a  specified  factor.  Then,  we  can  use  to  stratify 
Cl  and  record  the  4  entries  (a,  6,  c,  d)  in  a  2  x  2  ta¬ 
ble  based  on  the  factor  level  and  the  response  for  Ci. 

event 
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(i  =  2, 3,4,5)  in  turn  and  using  only  the  remaining  sub- 
populations  to  select  a  split  again  based  on  race.  The 
entries  based  on  £»,  which  will  be  stratified  by  s* ,  will  be 
recorded.  Sum  the  cell  entries  over  the  five  2x2  tables 
produced  by  the  5-fold  cross  vahdation  procedure,  and 
calculate  the  relative  risk  using  the  combined  the  2x2 
table. 

Suppose  that  we  apply  the  root  node  split  in  a  tree  to 
an  independent,  ideally  similar  data  set,  then  we  would 
have  a  2  X  2  table,  called  T,  with  entries  (cq,  fro,  cq,  do)- 
Again,  ao/fro  is  the  odds  for  the  low  risk  group  and  co/do 
is  the  odds  for  the  high  risk  group.  Now,  every  2x2  table 
obtained  in  the  cross  validation  is  an  approximation  to  T 
provided  that  the  total  sample  size  is  taken  into  accoun- 
t,  despite  the  fact  that  different  splits  may  be  chosen. 
The  combination  of  these  2x2  tables  is  a  way  of  aver¬ 
aging  and  generally  provides  better  approximation  to  T 
than  those  individual  tables.  Since  potentially  different 
splits  may  be  chosen,  the  combined  2x2  table  does  not 
have  an  intuitive  interpretation,  but  the  combination  is 
legitimate  from  a  statistical  point  of  view.  Here  is  the 
reason.  All  individual  2x2  tables  are  generated  by  the 
same  algorithm  and  the  variation  among  them  is  sole¬ 
ly  due  to  the  random  sampling  in  the  cross  validation. 
Therefore,  these  2x2  tables  can  be  viewed  as  i.i.d.  ran¬ 
dom  4- vectors  and  equal  to  the  originally  selected  split 
in  distribution.  Therefore,  mathematical  operations  on 
these  tables  are  well-defined.  This  is  a  key  idea  behind 
the  cross  validation  that  is  also  used  in  Breiman  ei  aL 
as  elaborated  in  Remark  1. 

When  the  selected  split  is  not  spurious  but  real,  the 
constitutents  of  the  low  and  high  risk  groups  determined 
in  the  cross  validation  should  be  similar  although  they 
may  be  different,  and  hence  the  cross  vahdation  estimate 
of  relative  risk  should  be  close  to  the  resubstitution  es¬ 
timate.  In  contrast,  if  the  selected  split  is  spurious,  the 
split  is  hardly  reproducible  and  the  constitutions  of  the 
low  and  high  risk  groups  from  the  cross  validation  can 
be  very  different.  The  resubstitution  and  the  cross  val¬ 
idation  estimates  should  also  be  very  different.  In  this 
case,  neither  the  resubstitution  nor  the  cross  validation 
method  may  provide  an  accurate  estimate  for  the  rela¬ 
tive  risk.  What  is  important  is  that  the  spurious  split  is 
identified  and  the  precise  level  of  the  relative  risk  is  no 
longer  of  great  interest.  To  some  extent,  we  must  make 
a  subjective  call  on  the  basis  of  the  discrepancy  between 
the  two  types  of  estimate  —  the  same  dilemma  as  was 
seen  in  determining  a  spurious  split. 

It  is  also  helpful  to  draw  a  line  between  an  a  priori 
defined  split,  sq,  ai^d  a  split,  s,  selected  from  the  impurity 
criteria  using  a  learning  sample  £.  Where  sq  and  s  are 
conceptually  different  even  if  they  are  actually  the  same. 


For  example,  before  performing  the  tree-based  analysis 
we  might  have  decided  to  calculate  the  relative  risk  of  a 
disease  comparing  black  vs  white.  Then,  sq  is  black  vs 
white.  The  relative  risk  can  be  obtained  directly  from 
the  data  without  any  adjustment  because  there  is  no  bias 
due  to  the  split  selection.  After  the  tree-based  procedure 
is  applied  to  the  data,  a  selected  split  s  may  turn  out 
to  be  So.  This  time,  there  is  a  potential  bias  when  the 
relative  risk  for  s  is  calculated  by  resubstitution.  We 
cannot  treat  s  in  the  same  way  as  sq  even  though  they 
look  identical.  Instead,  this  s  should  be  regarded  as  the 
same  as  a  split  s*  which  may  be  obtained  by  the  same 
algorithm  from  a  learning  sample  £*  that  is  the  same  as 
C  in  distribution. 

Remark  1.  This  cross  validation  procedure  is  in  fact 
inspired  by  the  analogy  with  Breiman  et  al.  (pp.75~- 
78).  The  connection  can  be  made  as  follows.  As  pointed 
out  earlier,  we  attempt  to  evaluate  the  influence  of  each 
variable  selected  to  form  the  tree.  They  are  interested  in 
the  misclassification  rate  of  a  sub-tree  corresponding  to 
a  specified  complexity  parameter.  Therefore,  two  pro¬ 
cedures  are  similar  in  the  sense  that  they  require  some¬ 
thing  fixed,  i.e.,  a  variable  versus  a  complexity  parame¬ 
ter.  In  the  step  of  using  cross  validation,  the  cut-off  for 
the  fixed  variable  may  vary  in  our  procedure  while  the 
structure  of  the  sub-tree  corresponding  to  the  specified 
complexity  parameter  changes,  too,  in  that  of  Breiman 
ei  al  Therefore,  the  two  procedures  are  similar  in  the 
sense  that  they  allow  something  to  vary  during  cross  val¬ 
idation.  Finally,  both  procedures  take  an  average  step 
over  the  results  during  cross  validation.  Breiman  ei  al 
(p.77)  acknowledge  that  the  cross  validation  estimates  of 
misclassification  rates  tend  to  be  conservative  in  the  di¬ 
rection  of  overestimating  misclassification  rates.  In  our 
situation,  we  would  then  expect  that  the  cross  valida¬ 
tion  estimate  of  relative  risk  may  be  biased  toward  the 
null  value.  Therefore,  it  is  useful  to  look  at  both  the 
resubstitution  and  the  cross  validation  estimates  of  rela¬ 
tive  risk  because  they  are  potentially  biased  in  opposite 
directions  and  presumably  the  resubstitution  estimates 
are  more  biased. 

2.4  Missing  Data 

In  most  applications  of  the  generalized  linear  model, 
users  take  naive  approaches  to  deal  with  missing  data 
such  as  deleting  subjects  which  have  missing  data  in  any 
covariates.  As  pointed  out  by  Breiman  ei  al  (1984, 
Section  5.3.2),  this  strategy  may  result  in  a  loss  of  a 
substantial  portion  of  the  data.  The  nature  of  the  recur¬ 
sive  partitioning  procedure  makes  it  possible  to  handle 
missing  data  in  a  more  efficient  manner. 

Two  notable  approaches  have  been  proposed  in  the  lit- 
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erature  by  Breiman  et  ai  (1984,  Section  5.3)  and  Clark 
and  Pregibon  (1992).  The  former  uses  surrogate  splits 
to  mimic  the  best  splits.  When  a  subject  has  a  missing 
value  on  a  co variate  by  which  the  best  split  is  defined, 
a  surrogate  split,  based  on  another  covariate,  would  be 
used  to  assign  the  subject.  We  call  the  latter  ‘‘missings 
together”  (MT)  method.  In  other  words,  the  cases  with 
missing  values  for  the  splitting  variable  are  assigned  to 
one  node. 

The  strategy  of  using  the  surrogate  splits  fits  perfect¬ 
ly  into  the  tree-based  method  and  is  a  most  thoughtful 
idea.  The  surrogate  splits  are  implemented  in  CART 
and  can  be  carried  over  without  user  involvement.  Nev¬ 
ertheless,  we  have  a  practical  concern  with  this  strategy, 
that  is,  when  a  reader  looks  at  a  published  tree,  it  is  not 
clear  how  a  case  with  missing  information  is  assigned 
unless  the  authors  give  all  (primary  and  secondary)  sur¬ 
rogate  splits  associated  with  a  tree.  This  information 
is  in  fact  available  in  the  original  CART  printout,  but 
unfortunately  only  a  limited  amount  of  information  may 
be  published.  In  classification  problems,  we  can  incorpo¬ 
rate  surrogate  splits  into  an  automated  procedure  with¬ 
out  worrying  about  what  they  are.  In  contrast,  for  the 
present  application,  we  must  know  the  surrogate  splits  in 
order  to  report  and  interpret  them.  It  could  be  tedious 
to  describe  all  possibilities  of  applying  the  primary  and 
secondary  splits. 

The  MT  strategy  provides  an  alternative,  simple  ap¬ 
proach  for  handling  missing  data  although  it  may  not  use 
the  data  as  efficiently  as  the  surrogate  splits.  Now,  we 
describe  the  implementation  of  the  MT  strategy.  Sup¬ 
pose  that  Xi  is  a  nominal  covariate  taking  two  distinct 
levels  a  and  b  (the  idea  extends  immediately  for  more 
levels).  The  candidate  splits  accommodating  missing 
values  are  NA — ab,  NAa — b  NAb — a,  where  NA  stands 
for  missing  values.  The  idea  is  to  treat  NA  as  an  extra 
level  of  Xi .  If  Xi  is  ordinal,  Clark  and  Pregibon  suggest¬ 
ed  the  use  of  the  same  strategy  by  quantifying  x  first 
and  then  treating  it  as  if  it  is  nominal.  Suppose  that 
Xi  =  (1,2,3,4, 5,  WA)'  is  an  ordinal  predictor,  x  may 
first  be  converted  to,  say,  Xi  =  (a,  a,  a,  6, 6,  NA)'  in  which 
a  covers  1,2,3  and  b  covers  for  4  and  5.  Then,  x  would 
replace  x  in  partitioning. 

Clark  and  Pregibon ’s  implementation  of  the  MT 
strategy  has  two  limitations.  First,  the  quantification  ig¬ 
nores  the  original  order  of  Xi.  From  the  example  above, 
the  natural  order  in  {1, 2, 3}  versus  {4, 5}  vanishes.  How¬ 
ever,  the  order  of  a?,-  is  very  important  for  interpreting 
the  results.  Second,  the  quantification  often  uses  arti¬ 
ficially  coarser  measurements  of  the  covariates  than  the 
original  values  of  the  covariates  which  presumably  re¬ 
sults  in  coarser  splits.  For  instance,  2  in  the  example 


above  can  never  be  a  cut-off  value  for  x,-.  The  round-off 
effect  may  be  minor,  but  unnecessarily. 

We  propose  a  new  implementation  for  the  MT  strat¬ 
egy  by  replacing  the  original  x  with  two  new  variables. 
Let  Xi  =  (xii,  •  *  *,  Define  the  components  of 
and  x^  '  to  be  the  same  as  those  of  x*  when  the  compo¬ 
nents  of  Xi  are  not  missing.  For  all  missing  components 
of  Xi,  the  corresponding  values  of  x^^^  and  x^^  are  re¬ 
spectively  defined  as  minj(x,j)  -  1  and  maxj(x,j)  +  1. 
The  idea  is  to  regard  the  missing  value  as  an  additional 
distinct  value  of  x,-.  The  assigned  values  per  se  are  not 
important  and  should  be  viewed  as  a  generic  labeling  for 
missing  values.  What  is  important  is  that  the  labeling 
ensures  that  all  subjects  having  missing  data  will  be  sen- 
t  to  the  one  (left  or  right)  side  of  a  split.  For  example, 
taking  AT  =  6,  let  Xi=(2.1,  -4.0,  NA,  1.5,  7.3,  NA),  then 
a!p^=(2.1,  -4.0,  -5.0,  1.5,  7.3,  -5.0)  and  -4.0, 

8.3,  1.5,  7.3,  8.3).  The  two  copies  introduced  for  the  ordi¬ 
nal  variable  with  missing  values  compete  independently 
with  other  covariates  while,  obviously,  only  one  of  them 
may  be  selected  at  each  node.  For  example,  if  xp^  is 
chosen  as  a  splitting  variable,  it  sends  all  subjects  whose 
values  are  missing  for  this  variable  to  the  left  daughter 
node.  It  is  worthwhile  to  note  that  both  xp^  and  xp^ 
may  be  used  again  to  split  lower  nodes  thus  allowing  the 
cases  with  missing  values  to  go  to  either  side  of  the  node. 

Remark  2.  We  create  two  copies  of  one  variable  on  the 
basis  of  the  variable,  not  individual  subjects.  Suppose 
that  xi  and  X2  are  two  variables.  If  xi  has  missing  values 
in  any  of  its  components,  two  copies,  xp^  and  xp\  will 
be  created.  Similarly,  if  X2  has  missing  values  in  any 
of  its  components,  two  copies,  xp^  and  xp\  will  also  be 
created.  However,  if  subjects  1  and  2  have  missing  values 
in  both  xi  and  X2,  we  do  not  create  four  copies  of  xi  and 
X2  for  subject  1  and  another  four  copies  for  subject  2. 
We  use  the  same  copies  to  cover  both  subjects. 

3  Discussion 

In  this  paper,  we  have  advocated  the  use  of  a  tree-based 
method  for  epidemiologic  studies.  This  nonpararaetric 
method  is  particularly  convenient  and  appropriate  when 
the  objective  is  to  identify  risk  factors  associated  with  a 
certain  event,  to  discover  interactions  among  the  factors, 
and  to  find  high  risk  subpopulations.  When  applying 
CART  to  epidemiologic  studies,  some  modifications  are 
necessary.  We  have  designed  a  more  user-friendly  pro¬ 
gram  that  provides  users  with  options  to  control  the  tree 
structures.  Prior  to  using  the  existing  CART  technology, 
there  were  two  fundamental  decisions  that  users  would 
have  to  make:  the  prior  probability  of  the  outcome  and 
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the  cost  of  misclassification.  Since  our  data  come  froin  a 
prospective  study,  it  is  reasonable  to  estimate  the  prior 
probability  from  the  data.  Therefore,  the  prior  selection 
is  not  a  problem  for  us.  However,  it  has  been  observed  in 
the  literature  that  the  final  tree  structure  is  sensitive  to 
the  choice  of  misclassification  costs  [e.g.,  Breiman  et  al. 
(pp.  175-181,  1984)].  With  the  low  prevalence  rate  of 
the  outcome  in  the  present  application,  it  is  even  more 
difficult  and  subtle  to  specify  misclassification  costs  and 
then  to  justify  these  choices.  If  a  user  changes  the  tree 
structure  for  various  reasons,  it  violates  the  basic  condi¬ 
tion  for  using  the  procedure  of  Breiman  ei  al  (Section 
3.4.2, 1984).  When  this  occurs,  we  suggest  the  use  of  an 
alternative  pruning  procedure. 
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Abstract  -  We  study  the  problem  of  detection 
and  size  measurement  of  discontinuities  from 
sampled  noisy  data  of  an  univariate  function.  The 
proposed  tests  and  estimators  are  of  the  form  of 
linear  convolution  filters  with  characteristics 
employing  orthogonal  series  expansions.  In 
particular,  the  standard  and  conjugate  Fourier 
series  are  taken  into  consideration  .  The 
convergence  properties  (consistency  and  rate  of 
convergence)  of  the  proposed  estimates  are 
established. 


I.  INTRODUCTION 

Consider  an  univariate  regression  model 

y\  =  f(Xi)  +  z(Xi)  ,  i  =  1,  2,...,  n,  (1) 

where  x,,  Xj,  ...,  x„  are  fixed-design  points  in 
[-7C,  7c],  say,  z(x,),  z(x2) , ...,  z(x„)  are  uncorrelated 
random  errors  with  zero  mean  and  finite  variance 

and  f  is  the  unknown  regression  function. 

The  detection  of  some  singular  points 
(discontinuities,  comer  points,  etc)  in  an  otherwise 
smooth  function  is  an  important  problem  in  a 
number  of  areas  as,  e.g.,  system  theory, 
signal/image  processing  and  statistics  [1],  [5],  [7], 
[8].  In  the  area  of  image  processing  a  large  variety 
of  different  operators  for  locating  of  changes  in 
image  intensities  have  been  suggested. 
Traditionally,  the  proposed  techniques  have  been 
introduced  ad  hoc  and  their  performance  has  been 
justified  by  simulation  studies  over  selected 
images. 

More  recently,  however,  optimal  detection  filters, 
obtained  in  the  process  of  optimizing  a  criterion 
being  a  combination  of  signal-to-noise  ratio,  the 
localization  measure,  and  resolution  (quantified 
by  number  of  false  responses  ),  have  been 


proposed  [2],[3],  [4],  [5],  [6].  The  local  maxima 
in  the  thresholded  output  of  such  filters  have  been 
used  as  an  estimate  of  the  discontinuity  position. 
No  rigorous  statistical  analysis  of  the  proposed 
techniques  has  been  carried  out. 

All  the  above  works  concern  finding  location 
of  the  edge,  they  do  not,  however,  estimate  the 
edge  size.  The  latter,  clearly,  can  be  an  useful 
component  in  the  image  reconstruction  and 
understanding  processes.  The  problem  of 
measurement  of  the  size  of  discontinuities  in  a 
function  was  first  (  in  the  image  processing 
literature)  studied  in  [5],  [6].  In  [8],  [9]  the  kernel 
type  nonparametric  regression  techniques  for 
estimating  the  locations  of  jumps  points  and  the 
corresponding  sizes  of  jump  values  have  been 
proposed. 

In  this  paper  we  propose  a  class  of  linear 
filters  which  are  able  to  localize  discontinuities  of 
a  function  of  virtually  any  form,  i.e.,  the  behavior 
of  the  function  in  the  neighborhood  of  the 
discontinuity  need  not  be  in  the  form  of  step  , 
ramp,  or  a  polynomial  of  a  finite  order.  Thus, 
we  can  copy  with  a  nonparametric  class  of 
discontinuous  functions.  The  proposed  techniques 
give  consistent  estimates  of  the  discontinuity  size, 
see  [10]  for  a  complete  account  of  our  techniques. 
Our  approach  stems  from  the  theory  of  Fourier 
series,  and  specifically,  from  results  that  are  related 
to  the  so  called  Gibbs  phenomenon  [11].  The 
problem  of  measuring  of  the  discontinuity  size 
was  first  addressed  as  early  as  1913  by  Fejer[12], 
see  also  [11].  There,  it  was  elaborated  in  the 
context  of  convergence  of  the  partial  sum  of 
Fourier  series  in  the  neighborhood  of  a 
discontinuity  of  the  function  being  expanded.  Here 
we  utilize  this  approach  in  the  case  when  sampled 
noisy  data  generated  by  the  regression  model  (1) 
are  available.  We  observe  also  that  the  proposed 
techniques  are  of  the  form  of  linear  filters  with 
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odd  impulse  response  functions  having  multiple 
zeros.  We  prove  that  the  filters  responses  at  a 
given  point  converge  to  the  discontinuity  size  at 
this  point.  This  reveals,  how  the  error  depends 
on  the  distribution  of  discontinuities  (our 
techniques  allow  multiple  discontinuities),  the 
function  smoothness  away  from  discontinuity, 
noise  characteristics,  sampling  rate,  and  the  filter 
bandwidth.  As  a  result,  an  optimal  value  of  the 
filter  bandwidth  is  obtained.  The  problem  of  the 
discontinuity  localization  is  also  examined. 

II.  DISCONTINUITY  MEASUREMENT 
AND  LOCALIZATION 

Let  f(x)  be  a  real,  integrable  function  defined, 
without  loss  of  generality,  over  [-7i,7t].  Let 

A(x)  =  f(x+)  -  f(x-)  be  a  discontinuity  size  of  f  at 
the  point  x. 

Our  aim  is  to  estimate  A(x)  using  the  convolution 
operators  of  the  following  form 


j  f(t)  Kq(x-t)  dt  ,  (2) 

where  the  filter  characteristic  Kq(x)  of  order  q 
satisfies  two  properties;  (1)  it  is  an  odd  function, 
(2)  it  has  2q+l  zeros  in  [-7C,7C]  ,  including  0  and 

±jc .  Furthermore,  the  operator  should  have  the 

property  that  j  f(t)  Kq(x-t)  dt  — »  A(x)  as  q-4oo 

J-n 

for  possible  general  class  of  discontinues 
functions. 

Filters  of  this  form,  in  the  context  of  edge 
detection,  has  been  studied  in  the  computer  vision 
literature  [1],  [2],  [3],  [4],  [5],  [6].  Typically, 
however,  the  value  q  =  0,  i.e.  ,  only  one  zero¬ 
crossing  at  x=0,  has  been  assumed  and  they  not 
estimate  the  size  of  the  discontinuity. 

In  this  paper  we  propose  (other  alternatives 
are  also  possible)  the  following  two  prescriptions 
forK,(x) 

^  q 

=  S  jsinjx  ,  (3) 


_  q 

Kq(x)  =  -r-J—  X  sinjx  .  (4) 

In  q  jt1 

Both  techniques  have  been  originated  in  the  theory 
of  Fourier  series,  see  [10],  [11],  [12].  The  kernel 
in  (3)  results  from  a  simple  integration  by  parts 
and  observation  that  Kq(x)  =  "  q  ^  Dq(x),  where 
Dq(x)  is  the  Dirchlet  kernel  of  order  q.  The  kernel 
I^(x)  is  related  to  the  theory  of  conjugate  Fourier 
series  [11],  i.e.,  Kq(x)  =  -^  Dq(x),  where 

Dq(x)  is  the  conjugate  Dirchlet  kernel  of  order  q. 

Since  only  the  discrete  and  noisy  data  (1)  are 
available  one  has  to  replace  the  integral  in  (2)  by 
some  its  discrete  approximation.  Combining  this 
with  the  definition  of  I^(x)  and  I^(x)  we  can 
define  the  following  estimates  of  A(x). 

A(x)  =  X  yj(Xj+l  -  Xj)Kq(x-  Xj) ,  (5) 

j  =  l 

^  n  ^ 

A(x)  =  X  yj(xj+i  -  Xj)^(x-  Xj)  .  (6) 

j  =  l 

Our  results  model  the  performance  of 
discontinuity  estimates  on  grids  which  become 

increasingly  fine,  i.e.,  as  8„  — >  0,  where 

5n  =  maxj  (xj+i  -  xj).  As  a  measure  of  discrepancy 

between  the  estimates  A(x),  A(x)  and  A(x)  we 

choose  the  mean  square  error  e(a(x)  -  A(x)f . 
The  behavior  of  both  estimates  is  described  in  the 
following  theorem. 

Theorem  1.  Let  f  be  a  function  of  bounded 
variation . 

Then 

e(a(x)  -  A(x)f  ~  n<y^  5nq  +  V^(f)  (q  5nf 

+  (f  f(t)Kq(x-t)dt-A(x))2  ^ 


(7) 
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E(A(x)  -A(x)  )2  =  7ta2  +  V2(f)(q5n  /In  q  f 

ln2  q 


+  (|  f(t)K,(x-t)dt-A(x))2  ,  (8) 

J-n 

where  V(f)  is  a  total  variation  of  f. 

The  first  terms  in  (7)  and  (8)  are  caused  by 
the  presence  of  noise,  while  the  second  ones 
represent  the  bias  due  to  discreteness  of  data  . 
The  third  terms  are  due  a  finite  q  used  in  the 
definition  of  the  filter  characteristics,  see  (3)  and 

(4).  Theorem  1  exhibits  that  var  A(x)  is  smaller 

than  var  A(x)  .  As  for  the  behavior  of  the  last 
term  in  (7)  we  can  show,  see  [10]  for  details,  that 
if  f(x)  has  right-hand  and  left-hand  derivatives 

for  all  X  e  {-%,  jt)  then  it  is  of  order  0(l/q^). 

Regarding  the  filter  Kq  we  show  that  the  last  term 
in  (8)  is  of  order  0(l/ln^q)  assuming  that  f(x)  is 
of  bounded  variation.  Thus,  the  filter  Kq  requires 
weaker  assumptions  than  Kq  in  order  to  extract 
the  discontinuity  size.  On  the  other  hand,  A(x) 

has  much  smaller  bias  than  A(x)  . 

It  is  apparent  that  the  first  two  terms  in  (7)  and 
(8)  are  increasing  as  q  becomes  larger.  This 
manifests  a  trade-off  between  random  (quantified 
by  the  variance)  and  systematic  (bias)  errors.  That 
is,  to  eliminate  a  systematic  error  one  should  use 
a  large  value  of  q,  whereas  a  small  value  of  q  will 
reduce  the  random  variation  and  discretization 
error. 

It  is  evident  from  (7)  and  (8)  that  in  order  to 
reduce  the  error  one  has  to  relate  q  with  6„  . 
Hence,  let  q  =  c  5n”  ,  c,  a  >  0.  Clearly,  if  a  > 

1  then  the  error  tends  to  infinity ,  while  for  0<  a 

<  1  the  error  goes  to  zero  as  §„  “^0  •  Direct 
minimization  of  (7)  and  (8)  with  respect  to  q  implies 

that  the  optimal  q  is  of  order  8n*  ^  ^  and  5n  V/n  5n  \ 
respectively.  Furthermore, for  6n « c/n, c> 0 the 
corresponding  errors  are  n'^'^  and  1/ln^  n .  Hence, 
the  estimate  A(x)  tends  to  A(x)  much  faster  than 

A(x) .  It  is  worth  noting  that  the  kernel  estimate 


proposed  in  [8],  [9]  can  reach  the  rate  0(n*2/3) 
provided  that  f(x)  =  v(x)  +  A  I^q  ij(x),  where 
v(x)  is  Lipschitz  continuous. 

Although  A(x)  is  slower  estimate  of  A(x)  it  can 
have  a  better  localization  properties  than  A(x).  In 

fact,  let  0  be  a  point  where  the  discontinuity  in 
f(x)  takes  place.  Assume,  without  loss  of 
generality,  that  there  is  a  single  discontinuity  and 

that  A^0)  >  0.  Then,  clearly,  0  =  arg  maxx  A(x) 

and  0  =  arg  maXx  A(x)  can  define  estimates  of 

0. 

It  can  be  shown  [10]  that  E  (o  -0)  =  0(n-  4/5  ) 

while  E  (0  -0f  =  O(n-  4/5  /n-  6/5(ii)).  Thus,  the 

discontinuity  localization  detector  0  based  on  A(x) 

can  outperform  that  one  which  uses  A(x) . 

This  is  a  very  surprising  result  since  A(x)  tends 
slower  to  A(0)  than  A(x). 

Furthermore,  one  can  recover  0  faster  than  A(0). 
Thus,  one  can  conclude  that  the  problem  of  edge 
localization  is  "easier"  than  the  problem  of  edge 
measurement. 

All  the  above  considerations  imply  that  a 
combination  of  both  techniques  can  be  an  attractive 

alternative,  i.e.,  apply  first  A(x)  to  detect  0  and 

then  use  A(0)  as  an  estimate  of  A(0). 

To  illustrate  the  aforementioned  results  let  us 
consider  a  piecewise  constant  function 


f(x)  = 


0  if  xcO 
Ai  if0<x<l 
Ai+  A2  if  l^x 


Figure  1  shows  A(x)  and  A(x)  for  q  =  10  and 
two  different  combinations  of  A]  and  A2  .  Figure 

2,  on  the  other  hand,  plots  A(x)  and  A(x)  locally 
in  the  neighborhood  of  x=l,  here  q=6.  The  noise 
variance  is  0.01  and  n  =  128.  It  is  clear  that  the 

A(x)  method  reaches  maximum  at  the  wrong 
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location,  i.e.,  x=1.15,  whereas  the  A(x)  estimate 
perfectly  localizes  the  discontinuity  at  x  =  1. 

Nevertheless,  the  value  A(l)  is  much  greater  than 
the  size  of  the  discontinuity  A(l)  =  1. 


Figure  1.  A(x)  and  A(x),  q=  10, 

(a)  Ai=  A2  =  0.5,  (b)  Ai=  0.5,  A2  =  -  0.5 


Figure  2.  A(x)  (in  gray)  and  A(x)  (in  black)  in  the 
neighborhood  of  x=  1;  Ai=  A2  =  1 
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Abstract 

Change-point  models  have  attracted  attention  in  a  va¬ 
riety  of  fields,  and  there  are  many  approaches  to  infer¬ 
ence,  both  parametric  and  nonparametric.  This  paper 
discusses  some  asymptotic  results  for  change-point  in¬ 
ference  in  the  context  of  nonparametric  regression.  In 
one  dimension,  a  change-point  can  be  defined  as  a  point 
with  a  discontinuity  in  one  or  more  derivatives  of  the  re¬ 
sponse  function.  A  method  for  fitting  change-points  with 
semiparametric  models  is  discussed  which  can  be  used 
with  arbitreiry  linear  smoothers.  Techniques  are  given 
to  identify  the  number  and  location  of  change-points,  to 
estimate  the  size  of  the  jump  discontinuities,  and  to  fit 
the  entire  response  function  with  discontinuities. 

Keywords:  change-point,  nonparametric  regression, 
semiparametric  model 


1.  Introduction 

This  article  reports  on  progress  in  adapting  fairly  gen¬ 
eral  nonparametric  regression  smoothers  to  estimating 
curves  with  features  such  as  jumps  and  cusps  at  known 
or  unknown  locations.  The  key  idea  is  to  use  paramet¬ 
ric  models  for  the  features  (e.g.  jumps  or  cusps)  and  to 
correspondingly  modify  an  otherwise  smooth  fit  to  in¬ 
corporate  these  features. 

There  is  a  very  large  literature  on  statistical  methods 
for  “change-point”  problems  (e.g.  see  Siegmund,  1986). 
A  prototype  for  such  problems  is  that  of  detecting  a 
possible  shift  in  a  normal  mean  in  independent  obser¬ 
vations  over  time.  Assuming  independent  observations 
yi,---,yn,  a  change  occurs  at  time  r,  1  <  r  <  n,  if 


\  iV(/i2,<T^),  i>T, 


(1.1) 


where  fii  ^  ^2-  The  classical  problems  are  to  (i)  de¬ 
termine  if  a  change  has  occurred  and  (ii)  if  so,  estimate 
when  this  happened. 

A  natural  generalization  is  to  relax  the  assumption  of 
piecewise  constant  mean  and  to  consider  models  in  which 
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the  mean  varies  smoothly  except  for  one  or  more  isolated 
change-points.  As  an  example,  consider  the  regression 
model 

Vi  =  f^iU)  +  £i,  t  =  1, . .  .,n, 


where  the  e,*  are  i.i.d.  errors  with  E{€i)  =  0,  ■£(£?)  = 
(T^  <  oo.  For  simplicity,  we  will  take  U  =  i/n.  A  non¬ 
parametric  regression  version  of  the  change-point  prob¬ 
lem  is 


f{U)y  U  <  r, 
U  >  r, 


for  some  r  €  [0, 1],  where  /  is  a  smooth  function  on  [0,  r], 
g  is  a  smooth  function  on  [r,  1],  and  /(r)  ^  flr(r).  As  in 
(1.1),  T  will  be  called  a  ^change-poini”  in  this  setting 
as  well.  The  situation  where  fi  is  continuous  but  has  a 
jump  discontinuity  in  the  first  derivative  at  some  point 
T  can  be  described  similarly. 

There  is  a  growing  body  of  literature  on  the  nonpar ar 
metric  regression  version  of  the  change-point  problem.  A 
parametric  version  in  the  spirit  of  the  problem  here  was 
given  by  McDonald  and  Owen  (1986),  and  Hall  and  Tit- 
terington  (1992)  proposed  methods  specific  to  estimating 
curves  with  peaks  and  edges.  Muller  (1992)  and  Wu  and 
Chu  (1993),  among  others,  have  proposed  methods  based 
on  differences  of  nonparametric  kernel  estimates.  Loader 
(1993)  has  recently  treated  change-point  problems  using 
local  polynomial  estimators.  There  is  also  a  vast  related 
literature  in  image  processing  devoted  to  edge  detection 
(see  e.g.  Tagare  and  deFigueiredo,  1990). 

The  methods  discussed  here  are  applications  of  a  gen¬ 
eral  class  of  semiparametric  models.  In  principle,  they 
solve  a  variety  of  problems,  and  they  have  the  advan¬ 
tage  of  not  requiring  specialized  smoothers.  Instead,  the 
methods  modify  arbitrary  smoothers  to  allow  estimation 
and  preservation  of  features  like  jumps  and  cusps. 


2.  Semiparametric  change-point  models 

Semiparametric  models  for  this  setting  go  back  at  least 
as  far  as  Wahba  (1984)  and  Engel,  Granger,  Rice  and 
Weiss  (1986)  for  spline  smoothing.  Eubank  and  Speck¬ 
man  (1994)  and  Clive,  Eubank  and  Speckman  (1993) 
have  recently  extended  these  models  for  arbitrary  linear 
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smoothers.  Suppose 


H{t)  =  +  /(<,), 


where 

^(0  =  4>k{t-  t) 

with 

Mt)  = 

for  k  >  1.  Here  /  is  assumed  smoother  than  <f>jt  (i.e. 
/(t-i) 

is  continuous  at  t),  and  hence  the  parameter  yS  is 
identified  as  the  size  of  ajump 

discontinuity  in  at  r. 

This  simple  model  generalizes  immediately  to  models 
with  discontinuities  in  more  than  one  derivative  at  r,  e.g. 

fi(t)  =  -  r)  +  M2{i  -  r)  +  fit), 

or  to  models  with  multiple  change-points  (ri,...,rr) 
such  as 


r  f*-v(ib-i)!,  i>T 

\  0,  t<T, 


i=i k=i 

Note  that  the  latter  model  has  p  =  si  H - 1-  Sr  param¬ 

eters. 

We  will  adopt  vector  notation  for  the  nonparametric 
regression  model  letting 


‘  yi  ‘ 

’  Kh)  ' 

'  Si  ' 

y  = 

yz 

/i= 

Kh) 

€2 

.  I^i^n)  . 

and  write  p  =  p  -h  Suppose  that  a  linear  smoother  is 
used  to  estimate  /i.  We  will  denote  the  result  of  smooth¬ 
ing  by 

fjL  =:  . .  . ,  =  Sy 

for  a  suitable  n  x  n  matrix  S. 

Assuming  known  change-points,  a  semiparametric 
model  with  p  parameters  can  be  written  in  terms  of  an 
nx  p  matrix,  for  example 


•••  (^SriU-Tr) 


nxp 


with  =  (/?i , . . . ,  ^p)',  to  obtain 

y  =  /  +  Xp  +  s. 


2.1.  Estimation  with  known  change-points 

A  general  method  (independently  derived  by  Denby 
(1986),  Speckman  (1988),  and  Robinson  (1988))  for  esti¬ 
mating  (3  with  good  properties  in  this  setting  is  to  min¬ 
imize 

min 

The  solution  can  be  expressed  as  ^  =  (X'X)”^X'(/  — 
5)y,  where  X  =  (J  -  S)X. 

To  motivate  this  estimator,  note  that 

(I  -  S)y  =  (/  -  S)f  -h  (7  -  S)XI3  -h  (7  -  S)e. 

If  /  is  a  smooth  function  and  5  is  a  smoother  matrix 
suited  to  the  smoothness  class  of  /,  then  {I—S)f  is  neg¬ 
ligible  in  comparison  with  {I  — S)X,  so  the  regression  of 
(7  —  S)y  on  (7  —  5)^  produces  an  approximately  unbi¬ 
ased  estimate  of  Letting  /  =  5(y  —  X ft),  the  entire 
function  can  be  estimated  as 

y  =  f->rXf  =  Sy  +  iI- S)Xf.  (2.1) 

Eubank  and  Speckman  (1991)  showed  that 

\e\\^  -  pf  <  i||(7  -  5)/||2  -p  ^tvS'S  -b 

n  n  n  n 

Since  the  first  two  terms  on  the  right  give  the  average 
mean  square  error  for  estimating  /  with  smoother  5,  if 
S  is  any  nonparametric  estimator,  the  convergence  rate 
is  slower  that  C)(l/n),  so  the  last  term  is  asymptoti¬ 
cally  negligible.  Thus  //  has  the  same  global  convergence 
properties  as  f  ^  Sy  does  when  /i  has  the  usual  smooth¬ 
ness  assumptions.  This  result  is  independent  of  choice 
of  smoother. 

2.2.  Example:  penny  data 

The  first  example  concerns  the  penny  data  given  in 
Scott  (1992)  and  displayed  in  Figure  1.  The  data  set 
consists  of  measurements  in  mils  of  the  thickness  of  a 
sample  of  90  U.S,  Lincoln  pennies,  two  per  year,  from 
1945  through  1989.  Penny  thickness  was  reduced  in 
World  War  II,  restored  to  its  original  thickness  sometime 
around  1960,  and  reduced  again  in  the  ’70s.  Superim¬ 
posed  on  the  plot  in  Fig.  1(a)  is  a  kernel  smooth  with 
bandwidth  A  =  7.  Fig.  1(b)  shows  the  fit  from  (2.1) 
with  change-points  for  the  years  1958  and  1974.  With 
Ti  =  58.5  and  r2  =  74.5,  the  model 

-  58.5)  +  -  74.5) 

was  estimated  with  a  Gasser-Miiller  smoother.  (Details 
on  the  choice  of  ri  and  T2  are  given  below.) 


thickness  (mils)  thickness  (mils) 
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Figure  1:  (a)  Penny  thickness  data  with  kernel  smooth, 
bandwidth  7.  (b)  Fit  with  change-points  58.5  and  74.5. 

3.  Change-point  detection  and  estima¬ 
tion 

In  practice,  the  location  and  even  number  of  change- 
points  may  be  unknown.  The  strategy  implemented  here 
has  three  steps.  First,  a  detection  scheme  is  used  to  de¬ 
termine  how  many  change-points  (if  any)  are  present. 
Second,  the  exact  location  of  these  change-points  is  es¬ 
timated.  Finally,  the  entire  function  is  fit  with  the  esti¬ 
mated  change-points  using  (2.1). 

3.1.  Detecting  one  or  more  change-points 

The  semiparametric  model  can  be  used  to  detect 
change-points  as  follows.  Suppose  one  is  searching  for 
a  sudden  change  in  the  (fc  —  l)st  derivative  of  /i.  At 
each  of  a  possibly  large  number  of  candidate  points  r, 
the  model 

is  fit.  Denote  the  parameter  estimate  as  iS(r).  Clearly, 
if  is  continuous  at  r,  then  P(t)  should  be  near 

0.  But  if  r  is  a  change-point  for  then  ^(r)  is  an 

estimate  of  ^  0.  To  calibrate, 

let 


The  problem  is  to  determine  a  critical  value  c  such  that 
\Z(t)\  >  c  denotes  a  change-point  in  the  vicinity  of  r 
while  controlling  for  false  signals. 

The  behavior  of  the  process  Z{t)^  h  <  t  <  1  —  /i, 
is  detailed  in  Speckman  (1993)  under  the  following  as¬ 
sumptions.  Assume  equally  spaced  points  in  [0, 1]  with 
ti  =  I'/n,  i  =  1, .  ..,n,  and  further  assume  i.i.d.  errors 
satisfying  E{ei)  =  0,  E{€])  =  <  oo.  To  be  specific, 

the  smoother  5  is  taken  to  be  defined  by 

t  =  l 

where  h  is  the  bandwidth  and  K  is  a  continuous,  sym¬ 
metric  function  with  compact  support  [—1,1].  We  as¬ 
sume  further  that  K  possesses  the  mth  order  smooth¬ 
ing  properties  f  K(u)du  =  1,  f  n^K(u)du  =  0,  r  = 
l,...,m  -  1,  and  f  u^K(u)du  0.  (For  simplicity, 
boundary  effects  are  ignored,  so  attention  is  restricted 
to  //(t),  h  <  t  <  1  h.) 

In  this  setting,  let 

-  t),  •  .  . ,  <f>k(tn  -  t))',  (3.1) 


and  for  fixed  t  let  (7  —  S)'^(j>kT  = 

(iyi(r),...,u;„(r))'.  Then 


y^(J-5)Vfcr 

4>'„ril  -  S)Hkr 

ELi  ^i{r)yi 

4iriI-S)Hkr' 


w{t)  = 


It  follows  that 


Zir) 


Hr) 

Var0{T)) 


ELi 


Speckman  (1993)  showed  that  Z{t)  can  be  well  ap¬ 
proximated  by  a  convolution  process  with  weights  lyi(T) 
of  the  form  w{ti  —  r).  Figure  2(a)  displays  actual  weights 
for  detecting  a  jump  in  the  function,  i.e.  =  1,  with 
n  =  100,  h  =  .1  and  r  =  .5.  Several  aspects  of  the  equiv¬ 
alent  filter  function  are  apparent.  In  Speckman  (1993), 
it  is  shown  that  w(i)  has  support  [— 2/i,2/i]  and  that 
u;(t)  =  g^{i)-g^{t),  where  g^{t)  =  flf-(-f),  and  g^  is  a 
one  sided  kernel  satisfying  f  g+{t)dt  =  1,  / tg+{t)dt  =  0 
with  support  [0,2/i].  Thus  the  semiparametric  estimate 
of  P  can  be  viewed  as  the  difference  of  two  one-sided  ker¬ 
nel  estimates.  This  compares  with  the  explicit  estimates 
constructed  with  differences  of  kernel  estimates  in  Muller 
(1992)  and  Wu  and  Chu  (1993),  for  example.  Figure  2(b) 
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This  is  a  relatively  large  bandwidth,  especially  in  com¬ 
parison  with  the  usual  “optimal”  h  ^  n~  typically 
recommended  for  smoothing  in  this  context.  Note  that 
the  constant  in  this  case  depends  on  so  it  is  pos¬ 

sible  for  finite  samples  that  a  region  of  high  curvature 
could  be  mistaken  for  a  jump  by  this  method. 

If  one  is  searching  for  a  potential  change  in  the  first 
derivative  (a  cusp),  k  =  2  and  the  bandwidth  must  sat¬ 
isfy  h  =  This  suggests  that  undersmoothing  is 

necessary  relative  to  the  usual  widths  chosen  for  smooth¬ 
ing.  From  a  practical  standpoint,  if  too  large  a  band¬ 
width  is  used,  a  point  of  sharp  curvature  may  show  up 
as  a  cusp. 

3.2.  Choice  of  critical  value 

For  fixed  the  problem  is  to  determine  a  constant 
such  that 


0.0  0.2  0.4  0.6  0.8  1.0 


(b) 


Figure  2:  Actual  filter  weights  lOt(r)  with  n  =  100,  h  = 
.1,  (a)  Ar  =  1  and  (b)  i  =  2. 


shows  weights  for  an  example  of  cusp  detection,  ^  =  2, 
with  n  =  100,  A  =  .1  and  r  =  .5.  Once  again,  w{i) 
can  be  seen  to  to  be  the  difference  of  two  one-sided  ker¬ 
nels  w{t)  =  g^{t)  ~  where  now  g^{t)  = 

f  g^{t)dt  =  0  and  / tg^(t)dt  =  1. 

The  following  theoretical  results  concerning  the 
asymptotic  bias  of  Z{r)  are  obtained  in  Speckman 
(1993). 

Theorem  17/^  =  ^(*”^)(r-{-)  —  ^  0, 

as  n  oo  for  some  constant  Ci  depending  only  on  AT, 
k  and  m. 

If  p  =  Q  and  fe  C72'"-*[0, 1],  then 

for  some  constant  C2  depending  only  on  K,  k  and  m. 

This  result  gives  some  guideline  to  choice  of  bandwidth 
for  the  detection  problem.  In  order  to  avoid  false  detec¬ 
tion  of  change-points,  one  needs  relatively  low  bias.  As 
an  example,  suppose  a  second  order  smoother  [m  =  2) 
is  used.  In  searching  for  a  jump  in  the  function,  i  =  1, 
and  it  is  easy  to  see  that  the  bandwidth  must  satisfy 
ft  =  to  have  asymptotically  negligible  bias. 


P{  max  \Z(t)\  >  Ca)  ^  a 

provided  is  continuous.  For  finite  samples,  this 

problem  is  not  well  posed  because  /(2”^~^)(r)  could  be 
arbitrarily  large.  However,  asymptotic  results  are  pos¬ 
sible.  Assume  the  usual  sequence  of  problems  = 
f^iUn)  +  Sin}  i  =  1, . . . ,  n,  n  =  1,2,  —  If  /i  is  fixed  and 
sufficiently  smooth  and  ft^^  — >  0  at  a  suitable  rate,  it  is 
possible  to  find  a  sequence  Can  such  that 

Pi  max  \Zn{T)\  >  Can)  ^ 

«n  1 “rtn 

In  Speckman  (1993),  the  asymptotic  distribution  is 
given  explicitly,  and  simulation  studies  are  reported.  Un¬ 
fortunately,  the  asymptotic  distribution  is  not  very  ac¬ 
curate  for  finite  samples,  even  when  bias  is  negligible, 
especially  for  A:  =  1.  For  i  >  2,  the  tube  formula  (c.f. 
Johansen  and  Johnstone,  1990,  or  Sun  and  Loader,  1993) 
can  be  used  for  improved  estimation  of  Ca,  and  the  au¬ 
thor  has  also  had  some  success  in  applying  the  Pois¬ 
son  clumping  heuristic  (see  Aldous,  1989).  However,  for 
ft  =  1,  the  author  has  found  that  often  the  best  approx¬ 
imation  to  the  critical  value  can  be  obtained  by  a  sim¬ 
ple  application  of  the  Bonferroni  inequality,  Ca  =  Za/2ni 
where  Zot  is  the  1  -  a  percentile  of  the  standard  normal 
distribution. 

Figure  3  shows  the  plot  of  Z{r)  for  the  penny  data 
with  ft  =  4.  Here  the  data  occurred  in  pairs,  so  the 
natural  independent  estimate  a  =  .820  on  45  degrees 
of  freedom  was  used.  There  are  45  years  in  the  data 
set,  and  the  middle  35  were  searched,  so  the  Bonferroni 
bound  z  05/70  =  3.189  is  shown.  The  change-points  in 
1958  and  1974  are  clearly  visible.  (Note  that  for  ft  =  1, 
Z{t)  is  actually  a  piecewise  constant  function  in  r.) 
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Figure  3:  Plot  of  Z{t)  for  fc  =  1,  ft  =  4.  Dashed  line 
shows  approximate  critical  value  by  Bonferroni  for  a  = 
.05. 


3.3.  Estimating  change-points 

Having  located  a  change-point  by  the  above  detection 
scheme,  the  next  step  is  to  estimate  the  exact  location. 
Again,  this  can  be  accomplished  in  principle  with  the 
semiparametic  model  for  quite  general  smoothers  using 
nonlinear  least  squares.  For  fixed  fc,  recall  the  definition 
(3.1)  of  the  weight  vector  The  (perhaps  local)  model 

/i(0  =  '^o)  +  f{i) 


can  be  estimated  by  minimizing 

simultaneously  in  /?  and  r.  Noting  that 

it  can  be  shown  that  (l>kT(^  ^  is  essentially  in¬ 

dependent  of  r,  so  the  nonlinear  least  squares  estimate 
of  T  is  asymptotically  equivalent  to  the  estimator  which 
maximizes  l/?(r)|.  Thus  the  asymptotic  distribution  of  f 
is  obtained  by  studying  the  Z{t)  process  in  a  neighbor¬ 
hood  of  To.  The  asymptotics  are  different  because  /?  is 
assumed  nonzero,  and  the  situation  is  analogous  to  the 
results  obtained  by  Muller  (1992).  Related  results  are 
obtained  by  Wu  and  Chu  (1993). 

The  case  ifc  =  2  is  worked  out  in  detail  in  Speckman 
and  Eubank  (1994).  (Similar  results  hold  for  k  >  2.) 
Assume 

fi{t)  =  -  To)  +  f{i). 

Under  the  assumptions  of  the  previous  section  for  m  =  2, 
the  following  results  are  obtained. 

Theorem  2  If  I3q  ^  0  and  is  hounded^ 


2/?o 


where  Ai  =  /"(ro+)  -  /"(ro-),  L  =  limVn^  and  Ci 
and  C2  arc  constants  depending  only  on  K. 

Here  L  is  defined  to  be  zero  if  ft  =  o(n~^/®)  or  a 
nonzero  constant  if  ft  converges  to  zero  at  exactly  the 
rate  Note  that  f  is  asymptotically  unbiased  only 

if  ft  =  o(n"^/®),  i.e.  if  slight  undersmoothing  is  used  rel¬ 
ative  to  the  usual  rate  for  best  estimation  of  fi.  Note  also 
that  if  ft  ^  r  -  Tq  =  Op(n“2/®),  the  best  possible 

nonparametric  rate  for  estimating  /i(t)  with  two  deriva¬ 
tives. 

Asymptotic  results  also  are  available  for  estimating 
Theorem  3  Under  the  conditions  of  Theorem  2, 

y/^0{f)  -  Po)  ®  N{C4A2L,Cs), 

where  A2  =  |(/"(''■o+)  +/"('’'o“))  O4  and  C5  are 
constants  depending  only  on  K. 

Thus  if  ft  ~  P  -  jSo  =  the  optimal 

rate  for  estimating  with  two  continuous  derivatives. 
Note  that  it  is  possible  to  estimate  tq  better  than  ^o- 

3.4.  Data-based  bandwidth  choice 

If  the  primary  goal  is  to  fit  a  function  with  disconti¬ 
nuities,  a  global  estimate  of  //  is  given  by 

/i  =  Sy  +  P(7-  5)y, 

where  P  =  X(X* with  X  =  (7  — 5)X.  Thus  the 
influence  matrix  is  5  -f  P(7  —  5),  and  it  is  natural  to 
modify  generalized  cross-validation  to  estimate  a  global 
optimal  bandwidth  choice.  To  that  end,  define 

_ lly-AHV» 

{\-{tr{S+P{I-S))lny 

Another  variant  which  might  be  less  sensitive  to  under¬ 
smoothing  follows  a  suggestion  of  Rice  (1984): 


Ti(ft)  = 


l-2(^r(5-hP(7-5))/n 


Note  that  tr{S-\-P{l-S))  =  tr{S)^p-ir{PS)  since  the 
projection  matrix  P  has  rank  p  by  assumption.  If  S  is 
symmetric  with  all  eigenvalues  between  0  and  1,  it  is  not 
hard  to  show  that  0  <  <r(P5)  <  p.  Since  tr{S)  — oo  as 
n  oo,  the  terms  involving  P  are  negligible  for  large  n. 
It  is  feasible  to  compute  ^r(P5)  directly  by  noting  that 


ir{PS)  =  iT{X{X^Xy^X^S) 
=  iT{{X^X)-^X^SX). 


Since  SX  can  be  obtained  by  smoothing  the  columns  of 
X,  the  diagonal  elements  needed  for  the  trace  can  be 
computed  directly  by  matrix  operations  or  by  regressing 
the  columns  of  SX  on  X . 
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Figure  4:  (a)  Motorcycle  data  with  kernel  smooth,  bandwidth  4;  (b)  crZ{t)  versus  time,  /i  =  7;  (c)  Fit  with  change- 
points  at  23.2  and  32.0;  (d)  Fit  with  initial  constant  to  13.2,  changepoints  at  23.2  and  32.0,  h  =  5.2. 


4.  Application  to  motorcycle  data 

To  illustrate  the  use  of  change-point  models  for  first 
derivatives,  consider  the  “motorcycle  data’*  in  Silverman 
(1985).  The  data  consist  of  132  observations  made  on  ca¬ 
davers  in  simulated  motorcycle  collisions.  The  explana¬ 
tory  variable  is  time  (in  milliseconds)  after  impact,  and 
the  dependent  variable  is  the  head  acceleration  (in  g)  of 
a  post  mortem  human  test  object.  Figure  4(a)  shows  the 
data  with  a  Gasser-Miiller  kernel  smooth  (A  =  4  chosen 
by  cross-validation). 

The  plot  and  the  smooth  show  sudden  changes  in  the 
direction  of  acceleration  somewhere  around  13,  23  and 
32  milliseconds.  In  private  communication,  S.  Portnoy 
has  suggested  that  these  features  might  be  modeled  ais 
change-points  in  the  first  derivative.  Such  a  model  does 
not  necessarily  imply  that  there  are  actual  corresponding 
physical  change-points,  but  a  model  with  cusps  might 
provide  a  better  fit  to  the  data  than  smoothing  and  also 
have  useful  interpretation. 

Figure  4(b)  shows  a  plot  of  <TZ{t)  for  these  data.  Un¬ 
fortunately  the  data  are  too  noisy  to  apply  the  detection 
criteria  above,  so  the  plot  is  not  calibrated.  For  visual 
clarity,  a  bandwidth  of  h  =  7  was  used,  and  cr  was  not 
estimated.  (With  smaller  band  widths,  the  last  peak  is 


not  as  apparent.)  There  are  three  obvious  local  maxima, 
approximately  at  13.2,  23.2  and  32.0  ms.  Of  course,  it  is 
very  difficult  to  determine  from  the  data  alone  if  these 
points  are  “real”  or  if  they  are  the  result  of  large  values 
of  //"(<). 

The  semiparametric  fit  with  change-points  23.2  and 
32.0  is  shown  in  Figure  4(c).  Unfortunately,  the  severe 
imbalance  in  variance  between  the  initial  data  (when  the 
head  is  at  rest)  and  subsequent  data  prevented  a  good 
fit  at  the  first  change-point.  This  can  be  handled  with 
weighted  least  squares,  but  an  alternative  strategy  is  pre¬ 
sented  below. 


5.  Constrained  estimation 


In  the  motorcycle  data,  it  is  reasonable  to  model  po¬ 
sition  as  initially  constant  until  impact.  This  motivates 
fitting  a  model  of  the  form 


t  <Tf 

t  >r, 


where  /(r)  =  c  and  f{t),t  >  r  is  smooth  but  otherwise 
unspecified.  (As  in  the  treatment  in  the  last  section,  this 
procedure  can  be  modified  to  fit  a  function  with  addi¬ 
tional  cusps  as  well.)  This  problem  can  be  addressed  in 
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several  ways  with  semiparametric  models.  One  method 
is  as  follows. 

Since  r  is  a  change-point  of  order  2,  it  can  be  located 
with  the  methodology  of  Section  3.3.  Letting  f  denote 
the  result,  the  natural  estimate  of  c  is 


c  = 


1 


Thus  the  problem  is  to  construct  a  semiparametric  curve 
estimate,  say  f{t)  with  a  cusp  at  f  satisfying  /(f)  =  c. 
Then 

^  1  /(0»  ^ 

will  be  an  estimate  with  the  desired  properties.  The 
required  /  can  be  obtained  by  weighted  least  squares 
subject  to  a  constraint.  Consider  the  model  fi{t)  = 
P<f>2{t  -  r)  +  f{t)  as  in  Section  2.,  and  let  U  be  the 
n  X  1  vector  such  that  Vf  =  /(f).  Then  the  problem  is 
to  solve 

miii||(/-5)(j/-X^)||^ 

subject  to 

L'(52/  +  X/?)  =  c. 

The  solution  is  easily  seen  to  be 


(5.1) 


f  =  f  +  PLiL'PL)-\c-L'f), 


where  /  =  5y  +  (the  unconstrained  semiparametric 
fit)  and  P  =  X{X'Xy^X'  as  before, 

A  mixed  semiparametric  model  was  applied  to  the  mo¬ 
torcycle  data.  All  three  potential  change-points  were  in¬ 
cluded,  and  the  fit  was  subject  to  the  constraint  (5.1). 
Generalized  cross-validation  of  the  combined  model  was 
used  to  obtain  a  new  bandwidth  =  5.2,  and  the  results 
are  displayed  in  Figure  4(d). 


6.  Summary  and  conclusions 

Change-point  problems  have  a  long  history  and  large 
literature  in  statistics.  Ideas  from  change-point  modeling 
are  also  very  closely  related  to  topics  such  as  edge  de¬ 
tection  and  fitting  functions  with  features  such  as  jumps 
and  peaks.  The  semiparametric  modeling  discussed  here 
provides  a  general  and  flexible  way  to  fit  models  with 
such  features  using  a  variety  of  linear  smoothers.  These 
models  also  provide  simple  ways  to  fit  functions  with 
properties  such  as  local  constancy. 
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Abstract  A  nonparametiic  algoridun  for  restoring  digital 
images  corrupted  with  additive  noise  is  presented.  Itt  is 
assumed  the  noisy  image  is  realization  of  a  q)atial 
autoregressive  process  which  also  has  a  regression 
compnnftnt  On  the  basis  ofo  die  nonpaiametric  functional 
ftgrimarinn  dieory,  a  nonpaiametric  estimate  of  the  image  is 
given  as  a  restoration  result  The  edge  preserving  issue  is 
unda  ctxisideradon  in  this  study.  With  the  proper  selection 
of  the  algorithm’s  parameters,  the  estimate  can  preserve  step 
edges  while  siqipressing  noise.  The  proposed  algraithm  is 
not  sensitive  to  the  estimation  accuracy  of  the  parameters, 
and  can  be  run  almost  as  quickly  as  a  local  averaging  filter. 


1.  Introduction 

Image  restoratitm  is  a  process  to  recova  the 
original  image  from  degradations  due  to  blurring  and  noise 
corrupting.  In  this  paper,  the  degradation  sources  are  limited 
to  additive  noises.  Ilie  most  common  model  in  this  setting 
is  the  following 

=  f {  i  ,  j'  )  +  (1) 

where  i,  j  =  0, 1 . m-1,  {yy)  is  the  degraded  observation, 

fy  =  f(ij)  is  the  original  signal,  {ey)  is  a  zcto  mean  noise 
that  may  contain  outliers.  The  assumption  on  Sy  implies  any 
restoration  procedure  must  be  resistant  to  the  outliers,  or 
robust  restoration.  In  statistical  point  of  view,  image 
restoration  under  this  model  can  be  given  by  a  robust 
regression.  This  ^[iproach,  however,  can  have  three 
problems.  Firstly,  an  ordinary  parametric  regression 
procedure  will  m^e  various  assumptions  on  the  signal 
function  and  the  distribution  of  noises,  which  limits  their 
practical  usefulness.  Secondly,  model  (1)  does  not  ctqrture 
the  ironstationarity  of  the  image  random  field  and  so  current 
smoothing  techniques  have  various  limitations  in 
performance.  And  finally,  these  smoothers  lack  efficient 


means  to  accommodating  the  need  for  edge  preserving. 
Recently,  iKxiparametric  functional  estimation  theory[l] 
provides  us  some  versatile  regression  tools  that  can  be 
utilized  to  recover  noise-degraded  images. 

It  is  well  known  that  image  restoration  is  an  ill- 
posed  invose  problem,  i.e.,  no  unique  soluticxi  exists. 
Hence,  Previous  studies  either  employ  minimum  mean 
square  error  (MMSE)  critoia  or  Bayesian  analysis  to 
estimate  (fy)  given  observations  {yy}.  In  the  MMSE 
approach,  images  are  initially  assumed  to  be  a  stationary 
random  field,  and  lata  to  be  a  nonstationary  mean, 
nonstationary  variance  (NMNV)  image  stochastic  model[2]. 
The  Bayesian  t^proach  to  restoration  is  based  on  the  a 
priori  knowledge  of  the  statistical  properties  of  the 
ensemble  of  objects  (0.  This  usually  t^s  the  form  of  a 
Marirov  random  field[3,6].  The  use  of  local  properties  is  the 
characteristic  of  both  NMNV  and  Markovian  models.  In  this 
paper,  local  dependence  is  explicitly  modeled  as  a  spatial 
autoregressive  process  compounded  with  a  regression 
component,  which  we  call  the  AutoRegressive-REgression 
(ARRE)  model.  Parametric  linear  ARRE  models  have  been 
studied  by  Ripley[4]  and  Oiff  and  Ord[5].  This  paper 
extends  the  linear  ARRE  to  be  a  genmal  ARRE  model 
which  is  adaptive  to  the  image.  The  smoothing  algoritfam 
based  on  this  model  can  smooth  out  both  additive  noises  and 
outliers  (impulse  noises)  while  preserving  sharp  edges  and 
comers  and  therefore  keeping  most  details  clear.  It  is  v^ 
time  effici^t  as  welL  If  no  impulse  noise  is  present, 
restoration  can  be  done  in  one  loop  on  an  image  matrix.  In 
addition,  each  pixel  can  be  processed  separately  without 
waiting  for  the  results  of  its  neighboing  pixels.  Tliis  makes 
it  suitable  for  parallel  processing. 

2.  ARRE  Image  Model  and  Its  Nonparametiic  Estimator 

L®t  N„  =  {yy  I  grid  point  (ij)  is  within  a  local 
neighborhood  of  (u,v)) ,  where  (u,v)  is  not  necessarily  a  grid 
poinL  The  ARRE  image  model  assumes 
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Yij  =  Sri  Nij  ,  i  ,  j  )  +  (2) 

where  i,  j  =  0,  1,  ....  m-1,  ^  =  g(Ng4j)  is  the  unknown 
image  fiuiction  that  needs  to  be  estimated  by  a  restoration 
procedure.  In  contrast  with  this  model,  we  would  like  to 
call  Eq.(l)  the  REgression  (RE)  model.  We  assume  that  this 
image  field  {y„}  satisfies  the  q>-mixing  condition[7].  We 
also  assume  that  random  field  {yuJ  is  homogeneous  in  q>- 
mixing,  i.e.,  coefficient  <p,  does  not  (fepend  on  position  (u,v). 
To  give  an  ARRE  restoration,  we  only  need  to  know  r. 


where  N„(k)  refns  the  k-th  element  of  N„,  Wj  and  Wq  are 
weights  satisfying  Wq  >  Wi  >  0  and  Wo  +  4wi  *  1.  These 
weights  play  impmtant  roles  in  regulating  smoother’s  outlier 
resisting  and  dkail  preserving  abilities.  Our  experiment 
results  show  this  four-neighborhood  system  works  well  for 
various  types  of  images.  We  define  dist(*,*)  as  the  square 
of  Euclidian  type  of  distance  fimction  just  for  the 
convenience  of  analyzing  the  mean  distance  £dist(N.y,Ny). 
We  denote  the  restored  image  by  (4iij}>  where  mxm  is  the 
image  size.  We  give  a  nonparametric  estimate  of  ARRE 
model  (2)  in  the  foUowing  equation  (3). 


f  ^  - —  u,v=0,l,  (3) 

EE  Kt,  (dlsC(N^.N^) )  (u-i)*,.^  ( v-i) 

I*u3^ 


which  detmnines  the  size  of  neighboriioods  N„. 

The  nonparametric  estimator  for  RE  model  (1)  and 
their  properties  have  been  well  studied  by 
researchers[8,9.10].  e^’s  can  be  both  dependent  and  non- 
idendcally  distributed  random  variables  satisfying  the  (p- 
mixing  condition.  One  recent  research  which  is  analogous 
to  our  study  in  the  time  domain  is  the  nonparametric 
prediction  for  an  autoregressive  time  series,  by  Collomb[7] 
who  studies  the  autoregressive  time  series  in  is  of  the 
form:  yi=r(yi., . y^J+ei. 

To  give  a  nonparametric  estimator  for  the  ARRE 
model  (2),  we  assume  observations  (y^)  come  finom  a  q>- 
mixing  random  field  and  letting  Njj  s  contain  equal  numb^ 
of  pixels.  The  neighborhood  structure  N„  is  determined  by 
the  (p-mixing  condition  of  the  given  image.  In  this  study, 
for  the  simplicity,  we  empirically  use  a  four-neighbor  system 
for  each  pixel.  N„  is  then  the  following  1x5  vector 

w  =  {  ^Yi-i  J-i  •  Yi-ij  •  Yc  •  J'ij-i  '  Yiji 

•  Vij-i  >  Yij  •  '  yi*xji 

where  the  cent»  pixel  y,  =  (yH  j-i+yw  j+yij.i+ys)/4  is  the 
initial  value  for  the  interpolation.  For  the  purpose  of 
robusmess  against  outliers,  we  let  jy  be  the  center  of  Ny  and 
leave  this  center  pixel  out  of  the  summations  in  (3).  We 
then  measure  the  distance  of  N„  and  Ny  by  the  following 
weighted  sum  of  squares: 


where  ky^(’)=kj('/hi),  i=l,2,  are  two  kernel  functions[l]. 

This  estimate  can  be  justified  in  two  ways.  Firstly, 
note  that  the  random  sequence  in  [7]  is  implicitly  assum^ 
to  be  graerally  nonstationary.  When  we  explicidy  include 
a  deterministic  spatial  variable  (ij)  to  cqrture  the 
nonstationarity  and  assume  image  signals  are  tp-mixing,  the 
course  of  proof  of  the  asymptotical  properties[7]  is  still  valid 
as  long  as  the  joint  density  function  of  yy  and  Njj  is 
continuous  in  the  spatial  position  variable  (ij). 

Secondly,  if  we  let  ki_i_(")  be  the  rectangular  kernel, 
(3)  will  degrade  to  the  RE  estimator  when  window  size  h,  is 
large  enough  and  (u,v)  locates  itself  within  the  smooth  area 
of  the  image.  The  ARRE  model  is  thoi  reduced  to  RE 
model  It  is  worth  noting  again  that,  in  [10],  random  noises 
do  not  necessarily  have  to  be  identically  distributed  nm 
independent  of  each  other.  What  the  ARRE  model  and  its 
estimate^’  differ  fixrm  RE  model  is  the  results  in  edge  areas 

if  (i-l<u<i) Aij-l<v<j) 
if  (i=u)A(j‘*v) 

and  in  the  areas  with  discontinuities.  Our  restraation 
algorithm  based  on  the  ARRE  model  {sesoves  edges  and 
details  while  RE  model  does  not  In  finite  sample  situation, 
we  have  found  through  simulations  that  the  triangular  kernel 
based  estimate  poforms  better  than  the  rectangular  one  in 
suppressing  certain  type  of  impulse  noises  but  wcHse  in 


disc {N^  ,  -NijU)  >  (2)  >  =  +Wo  (N^O)  (3 ) )  = 

+Wi(i7u^(4)  )2+Wi(Aruy(5)  -Nij(5))^ 
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smoothing  regular  noises.  In  our  image  restoration 
experiments  in  section  3,  we  always  let 
rectangular  kernel. 

There  are  fisur  parameters  hj,  hj,  Wq,  and  Wi.  We 
believe  that  regular  MSE-based  cross-validation[71  with  a 
smoothness  constraint  is  no  longer  sufficient  for  edge 
preserving  and  the  MSE  of  the  second  order  derivatives 
should  be  included  in  die  objective  function  of  the 
optunization.  While  this  is  the  diction  to  go,  the  possible 
performance  improvement  of  using  this  type  of  optimization 
would  be  restrained  by  the  complexity  of  the  restoration 
filter.  Our  approach  in  this  paper  is  to  consider  the  effects 
of  these  parameters  sqtarately  in  the  following  way. 

We  determine  hj  by  studying  the  mean  distance 
Edist(N,v4^^.  For  the  ease  of  analysis,  we  assume  the  noise 
fidd  {ey}  is  a  zero-mean  independent  sequence  with  a 
standard  deviations  (std)  o..  Outliers  are  z»o-mean  with  a 
std  a,  which  is  significantly  greater  than  cr,. 

It  is  easy  to  calculate  that,  in  the  smooth  area  with 
no  outliCTS,  J?dist(N.,Jly)  =  2oJ,  and  in  the  smooth  area  with 
an  outUer.  £dista^.vJ^^  =  Wo(<J^-<jf)  +  2<^.  When  and 
fy  are  located  on  the  opposite  sides  of  an  edge  that  has  a 
grey  level  contrast  d,  inin,(,^,yij,yBdist(N„Jly)  =  Wod*+2cJ. 
When  f,,  and  fy  are  located  on  the  same  side  of  tto  edge, 
max,(,.,„y„£dist(N.J^y)  =  4wid*+2oJ.  Therefore,  for  the 
object  of  rdbust  restoration,  we  should  choose  hj  such  that 

2oJ  <  hi  <  Wo(oJ-CJ^  +  2<^  .  (4) 

For  the  object  of  edge-preserving  restoration,  we  should 
choose  hi  such  that 

4wid*+2c^  <  hi  <  Wod*+2ai  ,  (5) 

where  d  is  the  minimum  grey  level  contrast  of  all  the  edges 
that  need  to  be  preserved.  An  additional  condition  on  the 
weights  follows  from  (5):  4wi  <  Wq.  Obviously,  to  select  hi 
to  satisfy  both  (4)  and  (5),  we  have  to  have  of  >  d*4wi/Wo 

+  oS- 

As  for  the  value  of  h2,  since  it  is  the  parameter  to 
control  die  balance  between  fidelity  and  smoothness,  we 
choose  its  value  according  to  the  nature  of  the  data.  In  this 
image  processing  application,  we  choose  to  use  either  2.5  or 
3,5  to  let  the  weighted  averaging  (3)  take  place  over  5x5  or 
7x7  windows  in  the  image  plane. 

Note  that  the  denominator  in  (3)  will  become  zero 
on  the  edges  or  outlying  noise  corruption  occurs.  To 


distinguish  outliers  from  edge  elements,  a  local  statistical  test 
is  sufficient  when  it  is  pluged  in  (3).  A  complete 
nonparametric  ARRE  image  restoration  algorithm  can  be 
found  in  [13]. 


3.  Experimental  Results 

In  all  the  experiments,  we  use  Wq  =  0.6  and  Wi  = 
0.1  so  that  Wo  +  4wi  =  1  and  4wi  <  Wo.  We  let  hj  be 
(4w,d*+2<^  +  Wod*+2<^/2.  hj  be  2.5  (5x5  averaging 
window)  except  for  the  tool  image  where  hf=3.5  (7x7 
avoaging  window).  For  each  experimental  object,  we  run 
the  ARRE  restoratiai  algorithm  twice,  called  twicing.  That 
means  that,  after  we  get  first  ouput  from  the  algorithm,  we 
talfft  this  ouput  as  the  input  of  the  second  run.  The  results 
are  summarized  in  Table  1  in  toms  of  the  Signal-to-Noise 
Ratio  (SNR)  improvement  Actual  photos  of  all  images  can 
be  found  in  [13]. 


4.  Conclusions 

In  this  paper,  we  present  an  edge-preserving 
restoration  algorithm  based  m  the  nonparametric  estimation 
of  an  autoregressive-regression  model.  Wealsodononstrate 
its  performance  by  the  experiments.  It  smooth  both  additive 
noise  and  additive  impulse  noise  while  preserving  details 
including  sharp  comma.  Whereas  the  priority  is  detail- 
preserving,  the  balance  between  noise  suppressing  and  detail 
preserving  can  be  adjusted  by  the  input  parametms  d  and  hj. 
Because  of  the  nonparametric  nature,  no  assumption  is 
required  concerning  the  distribution  and  the  independence  of 
the  noises. 

In  this  algorithm,  we  need  one  a  prior  information 
<T„,  the  standard  deviation  of  the  additive  noise,  to  determine 
the  parametCT  hj.  A  method  that  estimates  ct,  directly  from 
the  degraded  image  can  be  found  in  [12].  We  can  also 
simply  estimate  a,  in  a  flat  area  of  the  image.  In  [13],  we 
show^  that  the  ARRE  restoration  algorithm  is  not  sensitive 
to  the  estimation  error  of  a,  and  d. 
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degraded 

image 

ARRE 

restoration 

ARRE 

twidng 

recursive 

median 

simulated  image 

19.880 

28.807 

33.656 

30.652 

girl  image 

16.541 

23.936 

23.203 

18.923 

tool  image 

12.053 

22.592 

27.387 

21.932 

scene  image 

18.534 

21.550 

22.775 

22.660 

Table  1.  Signal-to-noise  ratio  improvement  with  the  ARRE  restorations. 
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Abstract 

Epi  Meta,  Version  1.2,  is  a  meta-analysis  software 
package  developed  for  the  Centers  for  Disease  Control 
and  Prevention  (CDC)  with  the  intent  of  distribution  to 
those  who  analyze  public  health  data.  The  software  was 
designed  to  1)  be  user-friendly  from  both  a  software  and  a 
statistical  point  of  view;  2)  provide  meta-analysis  options 
not  previously  available  to  this  group  of  users  in  a 
menu-driven  user-friendly  package;  3)  include  the  features 
necessary  to  produce  a  meta-analysis  for  particular  data 
structures  without  needing  additional  software  or 
knowledge  of  a  programming  language;  and  4)  interface 
with  CDC’s  Epi  Info  comprehensive  data  management  and 
data  display  system. 

Throughout  the  design  and  implementation  of  Epi 
Meta  a  balance  was  maintained  so  that  the  novice  user 
would  not  be  overloaded  with  too  many  decisions,  yet  the 
more  experienced  user  would  not  be  limited  to  a  "canned" 
meta-analysis.  Built  into  the  system  are  dtfault  choices 
which  provide  the  novice  user  with  a  standard  meta¬ 
analysis  and  enough  summary  information  and  graphs  to 
understand  program  output  and  analysis  results.  Program 
and  system  design  and  architecture  features  include  heavy 
emphasis  on  graphical  displays  to  evaluate  the  fitted 
models,  and  menus  to  make  it  easy  to  choose  and  iterate 
on  models  and  data.  Although  Epi  Meta  is  primarily 
focussed  on  the  meta-analysis  of  dose-response  studies  of 
relative  risks,  the  underlying  methods  are  much  more 
widely  applicable.  The  analysis  methods  fit  straight-line 
relations  within  each  study  relating  relative  risk  to 
exposure  dose  using  transformations  and  weighted  least 
squares.  Goodness-of-fit  of  the  dose-response  model  is 
assessed  and  an  outlier  analysis  is  petformed  by  means  of 
graphical  and  tabular  diagnostic  displays.  The  comparison 
across  studies  takes  the  resulting  slopes  and  intercepts 
from  the  within  study  analysis  and  individually  and  jointly 
compares  the  results  using  fixed  and  random  ejfects 
inferences.  Epi  Meta  uses  menus  and  on-screen 
information  to  guide  the  user  through  the  analysis  of  the 
multiple  individual  studies  and  the  comparison  across  all 


the  studies.  Integrated  into  the  package  is  the  data 
management  facility  of  Epi  Info,  allowing  for  easy  data 
entry  and  editing  of  the  data  throughout  the  meta-analysis 
session.  The  software  was  developed  by  Battelle  under 
Contract  No.  200-87-0540  with  CDC.  This  paper 
discusses  the  design  decisions  involved  in  producing  a 
stand-alone  statistical  software  package  through 
presentation  of  the  decisions  made  in  developing  Epi 
Meta. 


Overview 

Epi  Meta  was  developed  by  Battelle  for  CDC  to 
provide  public  health  officials  with  a  tool  to  help  them 
draw  appropriate  inferences  when  combining  results 
across  multiple  epidemiological  studies.  The  software  is 
oriented  toward  epidemiological  studies  and  is  intended  to 
be  used  by  public  health  officials  with  an  excellent 
understanding  of  the  data  but  not  necessarily  advanced 
training  in  either  statistical  theory  or  computer 
programming.  The  statistical  methods  to  be  included  in 
the  software  were  determined  by  a  literature  review  of 
methodology  and  application  papers  dealing  with  meta 
analyses  of  epidemiologic  and  medical  studies  and  by 
CDC’s  experience  with  the  target  audience  of  public 
health  officials.  The  literature  review  revealed  one 
striking  dissimilarity  between  the  statistical  approaches 
suggested  in  the  methodological  papers  and  those  actually 
utilized  in  application  papers.  The  methodology 
discussions  nearly  unanimously  recommended  the  use  of 
random  effects  model  based  inference  procedures, 
whereas  almost  all  the  meta-analysis  applications  reviewed 
were  based  on  fixed  effects  models  and  inference 
procedures.  Therefore,  a  primary  goal  of  Epi  Meta  was 
to  provide  user-friendly  software  that  offered  easy 
implementation  of  the  random  effects  model  to  address  the 
disparity  between  the  methodological  recommendations  of 
the  statisticians  and  the  methodological  practice  carried 
out  by  the  medical  and  epidemiological  meta-analysts. 

Other  requirements  included: 


*  Development  of  this  software  was  carried  out  by  Battelle  under  Contract  No.  200-87-0540  with 
the  Centers  for  Disease  Control  Epidemiology  Program  Office. 
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1.  The  package  needed  to  be  user-friendly  from  a 
computing  standpoint,  i.e.,  it  would  not  require 
any  computer  programming  and  would  be  easy  to 
learn  to  use.  As  a  corollary,  the  program 
documentation  needed  to  be  equally  user-friendly. 

2.  The  package  needed  to  be  user-friendly  from  a 
statistical  standpoint,  providing: 

a.  a  default  analysis  but  also  alternative 
analyses,  options  and  diagnostic  displays 
and  statistics. 

b.  the  ability  to  quickly  and  easily  iterate  on 
the  analysis  —  deleting  studies,  choosing 
transformations,  changing  models,  making 
predictions,  etc. 

3.  The  package  needed  to  be  portable  and  widely 
available,  not  requiring  expensive  or  specialized 
software  or  hardware. 

The  software  and  statistical  design  decisions  made 
in  developing  Epi  Meta  are  illustrative  of  programming 
issues  that  are  characteristic  of  the  development  of  stand¬ 
alone  statistical  software  for  customized  applications. 
These  distinctive  statistical  programming  requirements 
arise  from  the  fact  that  in  developing  user-friendly 
customized  statistical  software  for  individuals  who  are  not 
professional  statisticians,  there  is  not  necessarily  a  single 
series  of  steps  or  a  single  right  answer  for  any  given 
analysis  and  certainly  no  right  answer  for  all  analyses. 

The  added  layer  of  complexity  in  statistical  programming 
comes  with  the  difficult  decisions  concerning  such  issues 
as: 

■  how  much  of  a  canned  analysis  (a  black 
box)  do  you  provide 

■  how  many  options  do  you  make  available 

■  how  much  guidance  do  you  give 

■  how  many  warnings  and  diagnostic  tools 
are  required 

■  how  do  you  lead  the  unsophisticated  user  to 
conclusions  while  still  guarding  against 
inappropriate  inferences. 

Most  often,  statistical  analysis  is  dynamic  and 
iterative,  leading  to  the  additional  question  of  how  do  you 
develop  a  software  interface  and  output  so  that  it  is  easy 
to  iterate  and  the  user  has  the  information  necessary  to 
perform  these  iterations. 

The  software  design  decisions  that  address  these 
specific  statistical  programming  needs  involve  the  user- 
friendliness  of  the  system  (from  a  computing  standpoint): 


use  of  a  menu-driven  system,  on-line  help,  ease  of  data 
entry  and  data  editing,  etc.  The  statistical  design 
decisions  that  address  these  needs  can  be  viewed  in  a 
hierarchical  maimer.  On  the  first  or  primary  level  are  the 
statistical  methodology  and  programming  decisions  that 
are  made  by  the  software  developers.  These  decisions  are 
embedded  in  the  product  and  transparent  to  the  user, 
noted  only  in  the  technical  software  documentation.  An 
example  of  this  level  of  decision  in  Epi  Meta  would  be 
the  weighting  algorithm  (Dersimonian  and  Laird)  used  in 
the  random  effects  analysis.  On  the  second  level  are 
decisions  made  by  the  developers  that  cannot  be  changed 
by  the  users  but  which  are  noted  in  the  user’s  program 
output  as  informative  messages.  An  example  of  this  in 
Epi  Meta  is  the  use  of  the  theoretical  weighted  residual 
mean  square  (WRMS)  of  1  if  there  is  not  significant 
heterogeneity  in  the  weighted  linear  regression  for 
determining  the  within  study  dose-response  slope.  On  the 
third  and  highest  level  are  decisions  that  are  so  specific  to 
an  individual  analysis  or  so  capable  of  changing  the 
results  of  the  analysis  that  they  are  incorporated  as  user 
options  in  the  software  package.  An  example  of  this  level 
of  decision  in  Epi  Meta  is  the  choice  of  whether  to  use  an 
intercept  or  no-intercept  model  in  determining  the  within- 
study  dose-response  slope.  Further  examples  of  these 
types  of  decisions  are  given  in  the  presentation  of  Epi 
Meta  that  foilows. 

Epi  Meta 

Data  Management  System 

As  mentioned  above,  a  key  requirement  of  user- 
firiendly  statistical  software  is  the  ability  to  quickly  and 
easily  iterate  on  the  analysis.  This,  in  turn,  requires  ease 
of  data  entry  and  editing.  In  the  case  of  Epi  Meta,  it  was 
fisspntial  that  users  be  able  to  easily  add  and  remove 
dose/exposure  levels  and  studies.  For  this  reason,  Epi 
Info,  a  CDC-distributed  data  management  and  analysis 
software  package  familiar  to  many  public  health  officials, 
was  chosen  as  a  data  management  "front-end"  to  Epi 
Meta.  Epi  Meta  transparently  calls  Epi  Info’s  data 
management  facilities  to  allow  the  user  to  create,  edit  and 
save  data  sets  within  an  Epi  Meta  session. 

Epi  Meta  allows  the  user  to  create  two  types  of 
data  files.  The  first  type  is  for  the  case  where  there  are 
multiple  dose  levels  per  study. 

Figure  1  illustrates  the  first  data  entry  screen  for  this  type 
of  file.  For  each  study  in  the  analysis,  the  user  can  enter 
up  to  eight  dose  levels.  Within  each  dose  level  the  user 
enters  the  units  of  the  dose  level,  the  relative  risk,  and  the 
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DOSE  RESPONSE  HETA  ANALYSIS 

Please  enter  the  following  Information  for  EACH  study: 

Study  Name 

Dose/Exposure  Level  Data  (Haximum  of  6  Levels): 

Relative  Std  Error  9SX  Confidence  Bounds** 
Level  Units  Rlsk(RR)  (RR)  Lower  Upper 


**  Note:  Substitute  for  the  Standard  Error  of  the  Relative  Risk. 


PAGE  DOWN  to  enter  STUDY  LEVEL  CHARACTERISTICS 
STUDYNAHE:  All  entries  allowed  You  must  enter  data 

<Ctrl*N>-New  <Ctrl-F>*F1nd  F5-Pr1nt  F6- Delete  F9 -Choices  FlO-Done  Rec-  1 


Figure  1.  Data  Entry  in  Epi  Meta 


The  within-study  analysis  is  only  operative  in  the  case 
where  there  are  multiple  dose/exposure  levels  per  study. 

Once  the  data  file  for  analysis  has  been  specified, 
a  within-study  analysis  options  screen  is  presented  as 
illustrated  in  Figure  2. 

=====  Within-study  Analysis  ==; 

Choose  the  following  parameters: 

Model :  0 

0  •  Fixed  Intercept 
1  *  Estimated  Intercept 

Transformation: 

Relative  Risk:  1  Dose  Level:  0 
0  -  None 
1  -  Natural  Log 


standard  error  or  upper  and  lower  95%  confidence  bounds 
for  the  relative  risk. 

The  types  of  statistical  decisions  mentioned  above 
are  already  operative  at  this  stage  of  Epi  Meta.  For 
instance,  if  the  user  enters  both  the  standard  error  and  the 
95%  confidence  interval,  the  program  will  use  the 
standard  error  as  the  measure  of  variability  for  that 
exposure  level,  representing  a  primary  level  decision 
transparent  to  the  user. 

The  second  type  of  data  file  allowed  is  for  the  case 
where  there  is  a  single  dose/exposure  level  per  study.  In 
this  case,  for  each  study  the  user  enters  the  study  name, 
the  relative  risk,  and  the  standard  error  or  upper  and 
lower  95%  confidence  bounds  for  the  relative  risk. 

The  data  entered  into  either  of  these  study  level 
data  files  is  not  limited  to  dose/exposure  levels  and 
relative  risks.  An  option  allows  for  entry  of  user-defined 
response  variables  and  an  associated  measure  of 
variability.  However,  for  this  discussion,  a  series  of  dose 
levels  with  associated  relative  risks  within  each  study  will 
be  used  for  illustration  of  the  software. 

After  the  data  have  been  entered  into  Epi  Meta, 
the  analyst  has  the  opportunity  to  edit  the  data  prior  to 
analysis.  The  analysis  process  is  divided  into  two 
portions,  a  "within-study"  analysis  and  an  "among-study” 
analysis.  Each  portion  is  discussed  in  turn  below. 

Within-Study  Analysis 

The  within-study  analysis  provides  the  user  the 
ability  to  calculate  the  summary  statistic(s)  for  each  study 
(a  slope,  or  a  slope  and  an  intercept)  that  will 
subsequently  be  used  in  the  among-study  meta-analysis. 


Figure  2.  Within-Study  Options  Screen 

This  screen  is  an  example  of  the  third  and  highest  level  of 
decisions  required  in  statistical  programming:  those 
decisions  that  are  presented  as  options  to  the  user.  Even 
here,  however,  there  are  difficult  primary  level  choices  to 
be  made,  for  only  a  subset  of  all  possible  options  will  be 
made  available  to  the  users.  The  options  presented  allow 
the  user  reasonable  flexibility  in  determining  the  type  of 
analysis  without  providing  all  possibilities.  In  Epi  Meta, 
the  user  has  a  choice  of  intercepts  and  data 
transformations,  but  is  limited  to  the  straight  line  model 
and  weighted  linear  regression  chosen  by  the  software 
developers.  The  user’s  data  transformation  options  are 
limited  to  a  natural  log  transformation.  As  explained 
earlier,  defaults  are  provided  at  all  levels  of  the  program 
as  guidance  for  less  experienced  users.  These  defaults 
represent  a  compromise  between  providing  a  "black  box" 
analysis  and  providing  a  wide  variety  of  options.  In  this 
case,  the  default  option  for  the  user  is  to  run  the  within- 
study  analysis  using  a  fixed  intercept  model,  a  log 
transformation  of  the  relative  risk  and  no  transformation 
of  the  dose  level.  Decisions  concerning  the  default 
choices  are  often  very  difficult.  For  example,  in  the  case 
of  Epi  Meta,  many  epidemiologists  prefer  a  fixed 
intercept  model  because  of  the  contention  that  the  relative 
risk  at  dose  zero  is  known  to  be  1 .  On  the  other  hand  a 
statistician  may  prefer  the  use  of  a  variable  intercept 
model  to  allow  for  a  better  fit  within  the  range  of  the  data 
if  there  is  non-linearity  at  low  dose  levels.  Because  the 
target  audience  for  this  software  is  public  health  officials, 
the  fixed  intercept  model  was  chosen  as  the  default. 

Since  Epi  Meta  was  designed  primarily  as  a  tool 
for  public  health  officials,  many  of  the  design  decisions, 
such  as  the  use  of  defaults,  were  made  with  the  intention 
of  helping  those  target  users.  However,  other  options 
were  included  to  allow  users  to  modify  the  analysis.  A 
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good  example  of  this  occurs  in  the  case  where  the  user 
chooses  the  variable  intercept  model.  Here  the  user  is 
given  the  option  to  center  the  dose  levels,  reducing 
correlation  between  the  estimated  intercept  and  slope  for 
each  study.  The  screen  for  this  option  is  illustrated  in 
Figure  3. 


The  average  dose  level  is  provided  for  those  that  would 
like  to  center  the  doses  at  the  mean  level,  and  the  range 
of  dose  levels  is  provided  for  those  who  wish  to  specify 
some  other  level.  This  on-screen  information  is  an 
example  of  a  second  level  design  decision  that  is  user- 
friendly  both  from  a  computing  and  a  statistical 
standpoint.  Note  that  if  the  user  specifies  a  dose  level 
outside  the  range  of  the  reported  dose  levels,  the  user  is 
warned  and  asked  for  confirmation.  Here  the  user  is 
provided  guidance,  but  left  with  the  final  analysis  option. 
The  default  option  is  no  dose  centering,  allowing  users 
unfamiliar  with  the  reasons  for  choice  of  a  centering  value 
to  pass  through  this  stage. 

The  actual  calculation  of  the  within-study  slope 
and  intercept  involves  many  primary  and  secondary  level 
decisions,  including  the  appropriate  weighting  scheme  for 
the  weighted  linear  regression,  the  heterogeneity  test,  and 
an  algorithm  for  determining  the  appropriate  weighted 
residual  mean  square.  In  general,  options  for  primary 
and  secondary  level  decisions  are  not  offered  to  the  users 
for  one  of  two  reasons:  1)  the  chosen  methodology  is 
determined  to  be  most  appropriate;  or  2)  the  different 
options  that  could  be  made  available  would  have  only  a 
minor  effect  on  the  calculated  results. 

Presentation  of  diagnostics  and  results  in  Epi  Meta 
includes  both  numerical  and  graphical  displays  and  output. 
Many  primary  and  secondary  level  decisions  are  made 
here  to  help  determine  both  the  manner  in  which 
unsophisticated  users  are  led  to  appropriate  conclusions 
and  the  amount  of  information  available  to  more 
sophisticated  users  to  evaluate  and  choose  among  results. 


An  example  of  the  first  screen  of  numerical  output  in  Epi 
Meta  for  the  within-study  analysis  is  presented  in 
Figure  4. 


. -  EPI'Heta  - 

Model  Type:  Estimated  Intercept,  InCRelatIve  Risk),  Dose/Ewjosure  Level 
Study  Kane:  EXAMPLE  STUDY  1  Study  Number:  1 

Nunber  of  Dose  Levels;  6  Dose  Centering  Value:  0.0 DOO  (no  centering) 

Estlnote(s)  SE(Est1nate(s)) 

Estinated  Intercept;  -0.21262682  0.07713190 

Slope:  0.00230901  0.00066787 

Calculated  WRMS;  0.6601  df;  4  p-value:  0.6197 

Theoretical  WRHS  of  1.0000  with  Infinite  df  Is  used  In  the  analysis  since 
the  calculated  WRMS  offers  no  evidence  of  extra  variability. 

STUDY  SUMMARY 

Studentized 

Dose  Level  Relative  Risk  SE  (Relative  Risk)  Residuals 

60.0000  0.8900000  0.0507  -1.5119 

90.0000  1.0900000  0.1023  1.0961 

144.0000  1.2800000  0.1879  0.9313 

204.0000  1.2400000  0.2193  -0.2789 

258.0000  1.4700000  0.3519  0.0104 

403.0000  1.7000000  0.5501  -0.7608 


Note:  after  studentized  residual  indicates  It  1$  a  possible  outlier 


Figure  4.  Within-Study  Output 


The  first  few  lines  of  the  output  are  devoted  to 
summarizing  the  user-specified  input  decisions  made 
earlier.  Next,  a  summary  of  the  within  study  analysis, 
the  parameter  estimates  and  the  associated  standard 
errors,  is  listed.  Immediately  following  the  estimates  is 
the  goodness-of-fit  result  discussed  above.  The  calculated 
WRMS  and  degrees  of  freedom  are  listed  as  well  as  the 
associated  p-value  for  the  chi-square  statistic  which  is 
used  as  a  test  of  heterogeneity.  If  the  p-value  is  not 
significant  at  the  5  percent  level,  a  note  is  placed  on  the 
next  line  letting  the  user  know  that  an  internal  decision 
has  been  made  to  use  the  theoretical  WRMS  of  1  with 
infinite  degrees  of  freedom.  This  message  alerts  the  user 
to  a  statistical  methodology  decision  that  may  not  have 
been  anticipated,  while  still  maintaining  internal  control  of 
the  analysis. 

The  last  half  of  the  page  provides  a  summary  of 
the  study,  listing  for  each  dose  level,  the  response 
variable,  the  standard  error  of  the  response,  and  the 
studentized  residual.  An  outlier  test  is  performed  on  the 
studentized  residuals,  flagging  possible  outliers  with  a 
double  asterisk.  Here  the  analyst  can  use  either  the 
double  asterisk  as  an  indicator  of  a  possible  outlier  or 
examine  the  actual  studentized  residuals. 

Graphical  diagnostics  are  provided  through  three 
types  of  graphs:  1)  a  normal  probability  plot  of  the 
studentized  residuals,  2)  a  plot  of  the  studentized  residuals 
versus  the  dose  levels,  and  3)  a  plot  of  the  estimated  line 
and  the  observed  relative  risks,  appropriately  transformed. 
These  graphs  were  chosen  because  they  help  the  user 
visually  assess  in  a  simple  and  straightforward  manner  1) 
the  normality  of  the  data;  2)  possible  outliers;  and  3)  the 
fit  of  the  model  to  the  data.  An  example  of  the  type  of 
diagnostic  graphical  display  available  is  illustrated  in 
Figure  5. 
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SrODY  :  EXAMPI£  STODY  1 

ln(RelaUTe  Risk)  vs.  Dose/Exi)osure  Level 


Figure  5.  Within-Study  Diagnostic  Graph 


Among-Study  Analysis 

All  the  features  and  statistical  analysis  of  Epi  Meta 
discussed  so  far  were  designed  to  provide  the  input 
information  and  capabilities  required  to  conduct  the 
among-study  meta-analysis.  The  input  for  the  among- 
study  analysis  is  either  the  file  created  by  Epi  Meta  in  the 
Within-Study  Analysis  or  the  Single  Dose  Level  Per  Study 
data  file  created  using  the  data  management  system, 
where  both  files  contain  the  point  estimate(s)  and  their 
associated  standard  error(s)  for  each  study  which  will  be 
used  in  the  meta-analysis. 

A  slightly  different  approach  was  taken  for 
handling  the  statistical  programming  decisions  in  the 
among-study  meta-analysis  as  compared  to  the  within- 
study  analysis.  The  user  is  given  no  control  over  the  type 
of  analysis  to  be  conducted.  The  only  program  choice 
made  available  to  the  user  is  the  choice  of  the  data  file  on 
which  to  run  the  analysis.  Rather  than  allow  the  user  to 
choose  certain  analysis  options,  such  as  a  fixed  versus 
random  effects  model,  or  an  individual  versus  joint 
parameter  analysis,  the  decision  was  made  to  present  all 
results  for  several  selected  analyses,  along  with  certain 
warnings  and  guidance.  Two  levels  of  output  reports  are 
offered:  a  "Complete  Analysis  Report"  and  a  "Summary 
Only"  report.  Therefore,  in  a  similar  fashion  to  the 
within-study  analysis,  the  user  is  given  options  concerning 
which  analysis  to  use;  however,  the  results  for  all  options 
are  always  presented.  In  a  like  manner,  a  "default" 
analysis  is  available  in  the  form  of  the  summary  report, 
where  the  summary  results  presented  are  offered  as  the 
default  "answer". 


As  in  the  within-study  analysis,  the  most  difficult 
statistical  design  decisions  occurred  in  determining  which 
results  should  be  presented,  in  what  manner,  and  with 
what  degree  of  interpretation  and  guidance.  Again  both 
numerical  and  graphical  displays  and  output  were 
provided. 

An  example  of  the  "Complete  Analysis  Report", 
which  presents  all  analysis  results  and  graphs  for 
individual  and  joint  parameter  analysis,  is  provided  in 
Figure  6. 

The  first  three  lines  of  the  "Complete  Analysis 
Report"  provide  general  information  concerning  the  model 
on  which  the  individual  study  parameter  estimates  are 
based  and  the  file  used  for  the  analysis.  This  kind  of 
simple  user-friendliness  from  a  computing  standpoint  also 
has  benefits  from  a  statistical  standpoint,  making  for 
easier  iteration  and  comparison  of  analysis.  The  first  set 
of  results  presented  is  the  individual  parameter  analysis 
(slope  or  slope  and  intercept).  The  F-value  or  chi-square 
statistic,  degrees  of  freedom,  and  the  associated  p-value 
for  the  test  of  homogeneity  of  the  parameter  across 
studies  is  listed.  If  significant  heterogeneity  is  indicated, 
an  asterisk  (*)  is  placed  after  the  p-value  and  a  warning  is 
placed  on  the  line  immediately  after  the  test  of 
homogeneity  letting  the  user  know  that  the  fixed  effects 
model  estimates  are  judged  inappropriate  because  of 
significant  heterogeneity.  The  flag  and  warning  strike  a 
compromise  between  refusing  to  present  the  results 
because  they  are  judged  questionable  by  the  software 
developer’s  judgement,  and  presenting  the  results  with  no 
guidance  or  warning  when  inferences  based  on  the 
analysis  might  be  inappropriate. 

Immediately  following  the  test  of  homogeneity  of 
the  parameter  is  the  fixed  effects  model  analysis.  The 
combined  parameter  estimate  and  standard  error  of  this 
estimate  are  listed  along  with  95%  confidence  intervals. 

An  outlier  analysis  is  presented,  listing  both  the 
studentized  residual  and  an  outlier  indicator,  similar  to 
that  found  in  the  within  study  analysis.  The  final  line  of 
the  fixed  effects  model  analysis  always  reminds  the  user 
of  the  assumptions  on  which  the  model  is  based,  i.e.  the 
validity  of  the  fixed  effects  model  estimates  is  based  on 
the  assumption  that  all  study  parameters  are 
homogeneous.  The  output  for  the  random  effects  model 
for  the  individual  parameter  estimates  is  similarly 
displayed  if  the  among-study  variance  component  estimate 
is  positive.  If  this  estimate  is  not  positive,  then  only  the 
among-study  variance  component  estimate  is  listed  along 
with  a  warning  that  the  random  effects  analysis  is  not 
estimable. 


274  Meta-Analytic  Statistical  Software 


HODEL  TYPE 
FILE: 


AHONG-STUDY  ANALYSIS 


Estimated  InterxeDt/inW).  Oose/Exposure 
C: \EPIHETA\EXAHPLES\EXAHPLE.REC  (DEFAULT.MTA) 


INDIVIDUAL  PARAHETER  ANALYSIS 


ANALYSIS  OF  INTERCEPTS 
Test  of  Homogeneity  of  Intercepts 
Chi -Sq  Value:  3.670825  ( 


p-value:  0.159548 


Fixed  Effects  Model 
Confined  Intercept 
‘  "5632! 


SE(C«*1ned  Intercept) 
0.07052021 


•0.16563291 

Combined  Intercept  9SX  Confidence  Interval 
(  -0.30385252.  -0.02741330) 

Study  NimVjer  Student! zed  Residual  Outlier  Indicator 

2  -O! 07031039 

3  1.91348599  ^  „ 

NOTE:  Fixed  Effects  estimate  validity  based  on  the  assimption  all 

study  Intercepts  are  homogeneous 


Random  Effects  Model 
C0Bd)1ned  Intercept 
•0.0942910! 


SE(Co«b1ned  Intercept) 
0.14035810 


Combined  Intercept  95*  Confidence  Interval 
(  -0.69372430.  0.50514212) 

Among  Study  Variance  Cos^went 


0.02893749 
Outlier  Indicator 


Study  Humber  Studentized  Residual 

1  -0.96025834 

2  -0.30627719 

3  1.37299328  ^  ^ 

*  **•  indicates  the  Studentized  Residual  greater  than  2 

*****  Indicates  the  Studentized  Residual  greater  than  3 

ANALYSIS  OF  SLOPES 

Test  of  Homogeneity  of  Slopes  ^ 

Ch1-Sq  Value:  22.42.344309  df:  2  p-value:  0.000013* 

*  HARNING;  Fixed  Effects  Model  Inappropriate  --  Significant  Heterogeneity 


Fixed  Effects  Model 
Combined  Si 


Hope  SEICombined  Slope) 

0.00299949  0.00064700 


Combined  Slope  95t  Confidence^ Interval 
(  0.00173137.  0.00426762) 


Outlier  Indicator 


Study  MuRfcer  Studentized  Residual 

1  -4.16887527 

2  1.12484845 

3  4.56416075  *** 

NOTE:  Fixed  Effects  estimate  validity  based  on  the  assumption  all 
study  slopes  are  homogeneous 


SEICombined  Sloge^ 


0.00521883 


0.00007295 
Outlier  Indicator 


Random  Effects  Mode! 

Cod>1ned  Slope 

0.00915249 

Combined  Slope  95Jt  Confidence  Interval 
(  -0.01313578.  0.03144075) 

Among  Study  Variance  Component 

Study  Number  Studentized  Residual 

1  -1.00730318 

2  -0.24616392 

3  1.28972219 

•  **'  indicates  the  Studentized  Residual  greater  than  2 
•***•  indicates  the  Studentized  Residual  greater  than  3 

JOINT  ANALYSIS 

Test  of  HOMOGENEITY  of  Slopes  and  Intercepts  Jointly 
F-Value:  83.098032  df:  4.27  p-value:  0.000000* 

*  HARNING:  Fixed  Effects  Model  inappropriate  -•  Significant  Heterogeneity 

Fixed  Effects  Model  j  c 

Combined  Estimates  CovICo^ined  estimates) 

Interceot  0.40652180  0.00393325  -0.00003200 

Slope  -0:00009200  -0.00003200  0.00000038 

Individual  95»  Confidence  Intervals  for  Combined  Estimates 
Est.  Intercept  (  0.27783740,  0.53520620 

Slope  {  -0.00136209.  0.00117809) 

Study  Nimter  Quadratic  Form  Outlier  Indicator 

1  277.82533287  *** 

2  4.76512035  _ 

3  326.01328099  *** 

MOTE;  Fixed  Effects  estimate  validity  based  on  assumption  of 

homogeneity  of  straight  lines  across  all  studies  included 
in  the  analysis. 

Random  Effects  Model  ^  ^  c 

Combined  Estimates  ^  CoyiCombined  Estimates) 

Interceot  -0  06646357  0.01114417  0.00032883 

Slo™^  5:00984438  0.00032883  0.00002246 

Individual  95X  Confidence  Intervals  for  Cortined  Estimates 
Est.  Intercept  (  *0.51730860. 

Slope  (  -0.01039752.  0.03008629) 


Among  Study  Variance-Covariance 

0.01958782 

0.00112293 


0.00112293 

0.00006438 


Outlier  Indicator 


Study  Humber  Quadratic  Form 

1  1.48733859 

2  0.29681320 

3  3.08044045 

^*'  indicates  the  Quadratic  Form  greater  than  CM'SquareCdf  -  2.  0.95) 
►*’  Indicates  the  Quadratic  Form  greater  than  Chi-Square(df  ■  2,  0.9975) 


Figure  6.  Among-Study  Output 


The  results  of  the  joint  parameter  analysis  are 
printed  immediately  following  the  individual  parameter 
analysis.  Included  in  this  analysis  is  an  F-value  or  chi- 
square  statistic  for  a  joint  test  of  homogeneity  of  the  slope 
and  intercept  along  with  the  associated  degrees  of  freedom 
and  p-value.  Similar  to  the  individual  analysis,  if  the  p- 
value  is  less  than  0.05  an  asterisk  is  printed  after  the  p- 
value  and  a  warning  printed  immediately  following  the 
results  cautioning  the  user  that  the  fixed  effects  joint 
estimates  may  be  inappropriate. 

The  fixed  effects  combined  weighted  joint  estimate 
of  the  intercept  and  slope,  the  variance-covariance  matrix 
of  the  joint  estimates,  the  95%  confidence  limits  for  both 
the  intercept  and  slope,  and  the  joint  studentized  residuals, 
flagged  if  the  study  is  a  possible  outlier,  are  all  listed. 

The  random  effects  model  results  for  the  joint 
analysis  are  presented  next  if,  analogous  to  the  random 
effects  individual  parameter  analysis,  the  among-study 
variance-covariance  matrix  estimate  does  not  have  all 
elements  equal  to  zero.  If  the  among-study  variance- 
covariance  matrix  estimate  does  have  all  elements  equal  to 
zero,  then  the  random  effects  estimates  are  not  presented 
and  an  appropriate  warning  is  listed.  Otherwise,  output 
similar  to  the  fixed  effects  estimates  joint  parameter 
analysis  is  listed. 

The  last  part  of  the  output,  not  shown  in  Figure  6, 
presents  a  summary  of  the  data  used  in  the  meta-analysis. 
This  allows  the  user  to  store  the  actual  data  with  the 
results  for  future  reference. 

Note  that  Figure  6  shows  warnings  in  two  places 
that  the  fixed  effects  model  is  inappropriate.  The  results 
obtained  from  the  fixed  effects  individual  parameter 
analyses  are  inconsistent  with  those  from  the  fixed  effects 
joint  parameter  analyses.  This  inconsistency  does  not 
occur  for  the  random  effects  analyses.  Thus  the 
unsuspecting  user  is  warned  of  impending  pitfalls. 

As  shown,  the  "Complete  Analysis  Report" 
presents  the  user  with  both  Wd  and  random  effects 
model  based  inferences  for  both  an  individual  and  joint 
parameter  analysis,  with  certain  flags,  warnings  and 
guidance.  Many  primary  level  decisions  concerning  the 
statistical  methodology  (where  different  reasonable  options 
were  possible)  made  during  the  among-study  analysis  are 
explained  to  the  user  only  in  the  technical  appendix  to  the 
user  documentation.  These  include,  for  the  individual 
parameter  estimates:  the  estimate  of  degrees  of  freedom 
for  the  fixed  effects  standard  error,  the  joint  analysis  test 
of  heterogeneity,  the  degrees  of  freedom  for  the  test  of 
"significant"  studentized  residuals,  the  method  of 
estimating  the  random  effects  variance  component,  and 
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the  estimate  of  the  degrees  of  freedom  for  the  random 
effects  standard  error.  For  the  joint  parameter  estimates 
they  include:  the  degrees  of  freedom  for  the  fixed  effects 
variance-covariance  matrix,  the  estimate  of  the  among- 
study  variance-covariance  matrix,  the  degrees  of  freedom 
associated  with  the  among-study  variance-covariance 
matrix,  the  estimate  of  joint  residuals,  and  the  estimate  of 
joint  confidence  levels  on  predicted  values.  As  mentioned 
for  the  within-study  analysis,  these  decisions  that  are 
functionally  transparent  to  the  user  are  usually  imbedded 
in  the  software  because  either  1)  the  chosen  methodology 
is  determined  to  be  most  appropriate;  or  2)  the  different 
options  that  could  be  made  available  would  have  a  minor 
effect  on  the  calculated  results.  In  Epi  Meta,  decisions 
concerning  appropriate  methodology  were  based  on  a 
prior  literature  review  of  methodology  and  applications 
papers  and  represent  the  current  recommended 
methodology.  However,  there  are  some  cases  where 
evolving  methodology  must  be  incorporated  into  a 
statistical  program.  One  methodology  choice  in  Epi  Meta 
that  is  transparent  to  most  users  who  do  not  read  the 
technical  documentation,  is  the  ad-hoc  method  of 
estimating  the  random  effects  among-study  variance- 
covariance  matrix  in  the  joint  parameter  analysis.  As 
discussed  in  the  technical  documentation,  the  among-study 
variance-covariance  matrix  is  not  invariant  to  the  choice 
of  a  value  on  which  to  center  the  dose/exposure  levels. 
Therefore  alternative  choices  of  a  centering  constant  may 
result  in  differing  values  of  the  joint  analysis  among-study 
variance-covariance  matrix.  In  general,  the  more 
customized  the  statistical  application,  and  the  more 
advanced  the  methodology,  the  more  difficult  will  be  the 
choices  concerning  which  decisions  should  and  should  not 
be  embedded  in  the  software. 

The  graphical  displays  and  diagnostics  associated 
with  the  complete  analysis  report  include,  for  each  model 
and  each  parameter,  the  following  graphs:  1)  a  normal 
probability  plot  of  studentized  residuals;  2)  a  plot  of 
studentized  residuals  versus  the  study  number;  3)  a  plot, 
for  each  parameter  (slope  and  intercept),  of  the  estimated 
parameter  versus  the  study  number;  4)  a  plot  of  the 
quadratic  forms  of  the  joint  residuals  versus  the  study 
number  for  each  model;  and  5)  a  plot,  for  each  individual 
parameter  analysis,  of  the  parameter  estimates  for  each 
study  with  associated  95%  confidence  bounds  versus  the 
overall  estimate  and  its  95%  confidence  limits.  An 
example  of  the  diagnostic  plot  of  the  quadratic  forms  of 
the  joint  residuals  versus  study  number  is  provided  in 
Figure  7  below. 


Fixed  Effects  Mbdd 

Joint  Residuals  (QF)  vs.  Study  Number 


study  NuHbar 


Figure  7.  Among-Study  Diagnostic  Graph 


An  example  of  the  plot  of  the  overall  meta-analysis 
estimate  and  associated  confidence  intervals  along  with  the 
individual  study  results  is  provided  in  Figure  8. 


Eodivldual  Piaramfiter  Analysds 

Slope  Estimates  with  95%  Confidence  Bounds 


0.013109  0.044130  0.07V7t 

Prwdlctvd  Slop* 


Figure  8.  Among-Study  Summary  Graph 

The  "Summary  Only"  report  provides  a 
compressed  summary  of  the  results  of  the  analysis 
providing  a  "default*  best  estimate  of  the  combined 
weighted  slope  or  combined  weighted  slope  and  intercept. 
This  report  is  for  the  user  who  does  not  wish  to  choose 
between  the  alternative  analyses  presented  in  the  Complete 
Analysis  Report.  The  decision  on  which  analysis  to 
present  is  based  on  a  decision  tree  determined  by  the 
software  developers.  If  the  fixed  intercept  model  was 
used  to  combine  the  data  within  each  study,  then  the 


276  Meta- Analytic  Statistical  Software 


random  effects  model  estimates  are  given  provided  the 
variance  component  is  positive,  otherwise  the  fixed  effects 
model  estimates  are  given.  When  the  estimated  intercept 
model  is  used  to  combine  the  information  within  each 
study  and  the  among-study  variance-covariance  matrix 
does  not  have  all  elements  equal  to  zero,  the  joint  analysis 
random  effects  model  estimates  are  presented,  otherwise 
the  joint  analysis  fixed  effects  model  estimates  are  given. 
An  example  of  the  Summary  Report  output  is  presented  in 
Figure  9  below. 


cri'neio 


SUMMARY  OF  AMONG-STUDY  ANALYSIS 

MODEL  TYPE:  Estimated  Intercept.  In(Relative  Risk),  Dose/Exposure 
FILE:  C:\EPIHETA\EXAHPLES\E)(AMPLE.REC  (DEFAULT. HTA) 

JOINT  PARAMETER  ANALYSIS 

Random  Effects  Model 

Cwnbined  Estimate  (SE) 

Intercept  -0.0665  (  0.1056) 
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95t  Confidence  Intervals 
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a  0.2390  (  0.2229)  0.0190  (  0.0036) 
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Predicted  Relative  Risk  with  95t  Confidence  Bounds 
at  User -Specified  Dose  Levels 

Original  Data  File:  C:\EPIMETA\EXAHPLES\EXAHPLE.REC 

Within  Study  Model: 

Estimated  Intercept,  ln(Relative  Risk).  Dose/Exposure  Level 
Among  Study  Model : 

Random  Effects:  Intercept  -  -0.0665  Slope  -  0.0098 

User -Specified  Predicted  95*  Lower  and  Upper 
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1.7937) 

4.2095) 
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ESC  to  Quit 


Figure  10.  Results  of  User-Specified  Predictions 


Figure  9.  "Summary  Only"  Report 

Similar  to  the  Complete  Analysis  Report  the  first 
three  lines  of  the  numerical  Summary  Report  provide  a 
summary  of  the  model  used  to  combine  the  data  within 
each  study  and  the  file  from  which  the  data  came.  Next, 
the  default  best  estimates  and  the  standard  errors  of  the 
estimates  are  presented  in  accordance  with  the  above 
described  rules.  Finally,  a  brief  summary  of  the  actual 
data  used  in  the  meta-analysis  is  given. 

Only  one  type  of  graph  is  provided  in  the 
Summary  Report.  For  each  study,  the  parameter 
estimates  (slope  and  intercept)  with  the  associated  95% 
confidence  bounds  are  displayed  in  comparison  with  the 
overall  default  best  model  parameter  estimate  and 
associated  95%  confidence  interval  (See  Figure  8  above). 

A  final  among-study  analysis  output  option  allows 
the  interested  user  to  generate  predicted  relative  risks  and 
associated  95%  confidence  bounds  for  various  dose  levels. 
The  model  used  to  generate  these  estimates  is  determined 
using  the  same  decision  tree  discussed  above.  The  user 
can  enter  up  to  five  dose  levels  and  is  warned  if  any  of 
the  dose  levels  are  outside  the  range  of  the  model. 

Figure  10  illustrates  the  output  provided  when 
user-specified  predictions  have  been  chosen. 

The  first  half  of  the  output  page  summarizes  both 
the  within  and  among-study  analyses  so  that  the  user  is 
aware  of  the  analysis  performed  to  generate  the  estimates. 
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Abstract 


The  concept  of  statistical  strategy  is  introduced  and  used 
to  develop  a  structured  graphical  user  interface  for  guided 
data  analysis.  The  interface  visually  represents  statistical 
strategies  that  are  designed  by  expert  data  analysts  to 
guide  novices.  The  representation  is  an  abstraction  of  the 
expert’s  concepts  of  the  essence  of  a  data  analysis. 

The  interface  consists  of  two  interacting  windows:  the  gui- 
demap  and  the  workmap.  Each  window  contains  a  graph 
which  has  nodes  and  edges.  The  guidemap  graph  repre¬ 
sents  the  statistical  strategy  for  a  specific  statistical  task 
(such  as  describing  data).  Nodes  represent  potential  data- 
analysis  actions  that  can  be  taken  by  the  system.  Edges 
represent  potential  actions  that  can  be  taken  by  the  analyst. 
The  guidemap  graph  exists  prior  to  the  data-analysis  ses¬ 
sion,  having  been  created  by  an  expert.  The  workmap 
graph  represents  the  complete  history  of  all  steps  taken  by 
the  data  analyst.  It  is  constructed  during  the  data-atialysis 
session  as  a  result  of  the  analyst’s  actions.  Workmap  nodes 
represent  datasets,  data  models,  or  data-analysis  proce¬ 
dures  which  have  been  created  or  used  by  the  analyst. 
Workmap  edges  represent  the  chronological  sequence  of 
the  analyst’s  actions.  One  workmap  node  is  high-lighted  to 
indicate  which  statistical  object  is  the  focus  of  the  strategy. 


1.0  Motivation 


Data  are  the  lifeblood  of  science.  Because  computerized 
data-analysis  systems  help  scientists  understand  data,  they 
have  become  of  central  importance  to  the  scientific  enter¬ 
prise,  evolving  into  extensive  and  powerful  systems  capa¬ 
ble  of  performing  many  kinds  of  very  sophisticated  and 
complex  analyses. 

Unfortunately,  the  structure  of  data-analysis  systems  has 
evolved  willy-nilly  over  the  years.  While  much  thought 
has  been  focused  on  the  kinds  of  analyses  that  can  be  per¬ 
formed  by  these  systems,  less  thought  has  been  given  to 
their  overall  structure:  It  seems  that  the  more  powerful  a 
statistical  system  is,  the  more  clumsy  it  is  to  use. 

In  all  statistical  systems  that  we  are  familiar  with,  even 
when  simple  data-analysis  procedures  are  used,  novice 
users  are  soon  at  a  loss  as  to  how  to  combine  several  data- 
analysis  procedures  into  a  cogent  statistical  strategy  that 
reveals  the  basic  information  in  the  data.  The  very  power 
of  many  systems  can  actually  hinder  the  data-analysis 
task,  especially  for  users  who  are  novices.  We  have  the 
paradoxical  situation  that  for  many  users,  the  increasingly 
powerful  and  sophisticated  data-analysis  systems  are  actu¬ 
ally  less  suited  to  most  users  for  understanding  data. 
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In  this  paper  we  propose  that  data-analysis  environments 
should  support  the  visualization  of  statistical  strategies 
and  structures.  We  present  an  environment  that  guides  the 
data-analysis  steps  taken  by  novice  data  analysts.  Our 
environment  also  aids  data  analysts  at  all  levels  of  sophis¬ 
tication  by  showing  them  the  structure  of  their  analysis 
session.  In  addition,  sophisticated  users  can  perform  anal¬ 
yses  simply  by  typing  commands,  if  they  don’t  want  to  use 
the  graphical  interface.  Finally,  our  environment  includes 
graphical  tools  that  can  be  used  by  expert  data  analysts  to 
create  the  analysis  strategies  that  are  used  to  guide  novice 
analysts. 

2.0  Background  _ 


We  hold  that  data  analysis  is  a  highly  complex  activity 
(Young  Sc  Smith,  1991)  that  involves  repetitive  actions 
that  occur  over  and  over  again.  Thus,  data  analysis  is  a 
repetitive,  cyclical  search  for  understanding  (Lubinsky  & 
Pregibon,  1988).  We  believe  that  data  analysis  productiv¬ 
ity,  accuracy,  accessibility  and  satisfaction  will  improve  in 
an  environment  that  guides  and  structures  the  actions  that 
occur  during  the  search  for  meaning  in  data. 

One  of  our  main  design  principles  is  that  a  data-analysis 
system  should  incorporate  a  variety  of  environments,  each 
suited  to  a  specific  level  of  data  analysis  sophistication 
that  a  user  might  have,  so  as  to  maximize  the  data  analyst’s 
productivity  and  satisfaction.  We  believe  that  data-analysis 
software  should  be  designed  to  accommodate  the  complete 
range  of  data  analyst  sophistication,  from  novice  to  expert. 

We  identify  four  kinds  of  data  analysts:  novice,  competent, 
sophisticated  and  expert.  Accordingly,  we  propose  four 
kinds  of  environments:  First,  there  should  be  guidemaps  to 
guide  novice  data  analysts  through  complete  data  analy¬ 
ses;  second,  there  should  be  workmaps  to  inform  novice 
and  competent  data  analysts  of  the  overall  structure  of 
their  data-analysis  sessions;  third,  there  should  be  com¬ 
mand  lines  to  let  sophisticated  data  analysts  dispense  with 
the  visual  aids  when  they  find  them  unnecessary;  finally, 
there  should  be  an  authoring  mode  to  help  expert  data  ana¬ 
lysts  create  the  guidance  diagrams  that  are  used  by  nov¬ 
ices.  In  addition  to  these  four  environments,  which  are  all 
highly  interactive,  there  should  be  a  script  environment  for 
automating  repetitive  data  analyses.  These  five  environ¬ 
ments  should  be  seamlessly  integrated  within  the  statisti¬ 
cal  analysis  environment.  Analysts  should  be  able  easily 
switch  between  them  whenever  desired,  as  we  believe  that 
analysts  do  not  have  the  same  level  of  expertise  for  all 
aspects  of  data  analysis. 


Structuring  Data  Analysis:  Young  &  Smith  (1991)  argue 
that  the  process  of  data  analysis  is  improved  when  the 
environment  structures  the  actions  taken  by  the  data  ana¬ 
lyst.  They  suggest  that  an  on-going  data  analysis  should  be 
represented  by  an  icon-based  graphical  user  interface 
which  constructs  a  map  of  the  analysis  as  it  proceeds.  This 
map  shows  the  structure  of  the  actions  taken  by  the  data 
analyst,  and  the  data,  models  and  analysis  procedures 
involved  in  those  actions.  The  map  presents  the  analyst 
with  a  visualization  of  the  structure  of  the  analysis  session, 
and  can  be  used  to  return  to  previous  steps. 

For  our  work,  the  formal  representation  of  session  struc¬ 
ture  is  the  workmap.  Our  definition  is:  A  workmap  is  a 
directed  acyclic  graph  consisting  of  nodes  and  edges  (as 
suggested  by  Young  &  Smith,  1991),  where  a  node  repre¬ 
sents  a  data-analysis  object  (a  dataset  or  a  data  model)  or  a 
data-analysis  procedure  that  has  been  used  by  the  analyst, 
and  an  edge  represents  the  chronology  sequence  of  the 
objects  and  procedure  (the  creation  dependencies)  during 
the  analysis  session.  Taken  as  a  whole,  the  workmap  is  a 
visual,  object-oriented,  directly  manipulable,  structured 
representation  of  the  history  of  a  data-analysis  session. 

Notice  that  a  node  is  a  self-contained  unit  of  existing  data 
(dataset),  statistical  computation  (analysis  procedure),  or  a 
combination  of  the  two  (data  model),  whereas  edges  repre¬ 
sent  the  choices,  actions  and  decisions  that  a  data  analyst 
made  during  the  session.  Nodes,  which  are  the  basic  build¬ 
ing  blocks  of  the  on-going  data-analysis  session,  can  be 
selected  and  reviewed  at  any  time.  The  workmap  visual¬ 
izes  the  history  of  the  on-going  data  analysis.  It  is  a  real¬ 
ization  of  a  specific  statistical  strategy. 

Guiding  Data  Analysis:  At  each  step  of  a  data  analysis 
the  data  analyst  is  faced  with  many  choices.  Often,  the 
data  analyst  returns  to  previous  steps  in  order  to  make  dif¬ 
ferent  choices.  As  stated  by  Lubinsky  and  Pregibon 
(1988),  “Like  a  detective,  a  data  analyst  will  experience 
many  dead  ends,  retrace  his  steps,  and  explore  many  alter¬ 
natives  before  settling  on  a  single  description  of  the  evi¬ 
dence  in  front  of  him.”  We  argue  that  data  analysis  will 
improve  when  it  occurs  in  an  environment  that  guides  the 
actions  taken  by  the  analyst  to  understand  data. 

We  use  the  Artificial  Intelligence  (AI)  notion  of  strategy  as 
a  basis  for  developing  methods  for  guiding  data  analysts. 
Several  statisticians  have  developed  the  notion  of  a  statis¬ 
tical  strategy.  These  developments  are  extensively 
reviewed  by  Gale,  Hand  &  Kelly  (1993).  Our  definition  of 
statistical  strategy  is:  A  statistical  strategy  is  a  formal  rep¬ 
resentation  of  an  expert  statistician’s  conceptual  structur- 
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ing  of  1)  the  data~analysis  procedures  to  accomplish  a 
specified  data-analysis  task;  2)  the  data  analyst’s  actions 
(choices,  decisions,  etc.)  that  are  possible  with  the  proce¬ 
dures;  and  3)  the  relationships  between  the  procedures  and 
actions  needed  to  accomplish  the  task.  The  data-analysis 
task  is  to  understand  a  specified  data-analysis  object  (a 
dataset  or  data  model). 

For  our  data-analysis  environment  guidemaps  are  the  for¬ 
mal  representation  of  statistical  strategy.  Our  definition  is: 
Aguidemap  is  a  directed  cyclic  graph  consisting  of  nodes 
and  edges.  The  nodes  of  the  graph  represent  data-analysis 
procedures,  whereas  the  edges  represent  the  analyst’s  pos¬ 
sible  actions.  The  structure  of  the  map  indicates  the  order 
dependencies  between  the  procedures  and  the  actions  that 
can  be  taken  with  the  procedures  to  accomplish  the  data- 
analysis  task  of  understanding  the  data-analysis  object. 
Finally,  the  data-analysis  object  (dataset  or  data  model)  is 
represented  by  a  highlighted  node  of  the  workmap.  It  is 
said  to  be  the  focus  object. 

Notice  that  a  guidemap  node  is  a  self-contained  unit  of 
potential  statistical  computation,  while  a  guidemap  edge 
represents  the  expert’s  guidance  about  moving  from  one 
computation  to  the  next.  Nodes  are  the  basic  building 
blocks  of  potential  data  analyses,  i.e.,  of  statistical  strate¬ 
gies.  On  the  other  hand,  the  edges  in  the  strategy  represent 
the  data  analysts’s  possible  choices,  actions  and  decisions 


regarding  the  use  of  data-analysis  procedures.  They  indi¬ 
cate  permissible  paths  for  traversing  the  nodes.  Nodes  can 
only  be  selected  when  they  are  highlighted.  As  a  whole, 
the  guidemap  visualizes  and  abstracts  the  essence  of  an 
expert’s  statistical  strategy. 

3.0  Representing  Statistical  Strategy 

In  this  section  we  discuss  our  definition  of  statistical  strat¬ 
egy  in  detail,  focusing  on  the  four  key  aspects  of  the  defi¬ 
nition:  the  formal  representation;  the  data-analysis  object 
that  is  the  focus  of  the  strategy;  the  role  of  the  expert  stat¬ 
istician;  and  the  objects,  procedures  and  actions. 

3.1  The  Formal  Representation  of  Strategy 

First,  our  definition  states  that  a  statistical  strategy  is  based 
on  di  formal  representation.  Our  formal  representation  con¬ 
sists  of  graph  structures  like  that  shown  in  the  guidemap 
window  of  Figure  l.This  figure  is  a  screen  image  from 
UlSta,  the  visual  statistics  research  and  development  test¬ 
bed  (Young,  1994)  that  implements  the  ideas  in  this  paper. 

The  guidemap,  titled  Rnalysis  Cycle,  presents  the  over¬ 
all  statistical  strategy.  This  specific  guidemap  is  always  the 
first  guidemap  for  a  newly  created  dataset  object.  It  is  only 
a  small  portion  of  the  overall  strategy,  since  it  causes  addi- 
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Figure  1:  Formal  Representation  of  Statistical  Strategy  in  the  WorkMap  and  GuideMap 
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tional  “sub”-guidemaps  to  be  displayed  in  the  window. 
Taken  as  a  whole,  the  guidemap  in  Figure  1,  plus  all  of  the 
additional  guidemaps,  are  our  formal  representation  of  sta¬ 
tistical  strategy. 

The  strategy  concerns  a  specific  data  or  model  object, 
thus,  a  data  or  model  object  is  the  focus  of  the  analysis. 
The  focus  object  is  represented  in  the  workmap  window 
by  the  highlighted  (dark)  icon.  The  workmap  itself  shows 
where  this  object  fits  into  the  structure  of  the  overall  on¬ 
going  analysis.  The  two  separate  windows  emphasize  the 
separation  between  the  on-going  data  analysis  (mapped  in 
the  workmap)  and  the  strategy  that  is  guiding  the  data 
analysis  (mapped  in  the  guidemap)  We  discuss  the  work- 
map  in  the  next  subsection.  Here,  we  discuss  the 
guidemap. 

As  stated  above,  the  guidemap  is  a  directed  (possibly) 
cyclic  graph  consisting  of  edges  and  nodes.  In  our  work, 
guidemap  nodes  are  represented  by  the  rectangular  button 
icons,  and  guidemap  edges  are  represented  by  the  arrows. 
Thus,  the  buttons  show  potential  steps  in  the  analysis  that 
the  analyst  is  guided  to  take,  whereas  the  arrows  indicate 
the  flow  of  guidance  from  one  step  to  the  next.  A  node  is  a 
self-contained  unit  of  potential  statistical  computation 
which  may  do  its  own  computations,  or,  recursively,  call 
another  strategy. 

Buttons  can  be  “active”  or  “inactive”.  Active  buttons  are 
highlighted  (such  as  the  Link:Explore  button  in  Figure  1) 
and  are  ready  to  cause  an  action.  Clicking  on  the  ??  side  of 
an  active  button  enters  a  hypertext  which  causes  help  to  be 
displayed  about  the  action  of  the  button.  Clicking  on  the  !! 
side  of  an  active  button  enters  a  hypercode  which  causes 
the  button’s  action  to  be  initiated.  Once  the  button’s  action 
has  taken  place,  the  high-lighting  (activation)  of  the  but¬ 
tons  changes:  The  clicked  button  deactivates,  and  the  but¬ 
tons  that  it  points  to  are  activated.  Inactive  buttons  (such 
as  the  Link:Transform  button  in  Figure  1)  are  not  ready 
to  do  anything:  Clicking  on  them  has  no  effect. 

There  are  two  kinds  of  buttons:  Flow  Buttons,  which  con¬ 
trol  the  flow  between  various  portions  of  the  large  struc¬ 
ture  of  guidemaps,  and  Procedure  Buttons,  which  control 
the  use  of  data-analysis  procedures. 

Flow  buttons  include  the  Link,  and  GoTo  buttons  in  Figure 
1,  and  the  Return  button  in  Figure  2.  These  buttons  take 
the  user  to  other  guidemaps.  The  Link  button  takes  the 
analyst  to  a  new  strategy,  whereas  the  Return  button 
returns  to  the  linked-from  strategy.  The  Link  button  is,  in 
essence,  a  macro  data-analysis  procedure  which  is  itself  a 


Figure  2:  Formal  Representation  of 
Strategy  for  Exploring  Data 


strategy,  since  this  button  opens  up  new  strategies.  For 
example,  clicking  on  the  !!  portion  of  the  Link.'Explore 
button  in  Figure  1  causes  the  EKplore  Data  guidemap, 
shown  in  Figure  2,  to  appear.  Correspondingly,  clicking  on 
the  Return  button  in  Figure  2  (when  it  is  highlighted)  will 
take  you  back  to  Figure  1.  Upon  return  to  the  guidemap  in 
Figure  1,  the  high-lighting  of  the  buttons  will  change 
according  to  the  connecting  arrows.  That  is,  the 
Link:Explore  button  will  de-activate,  and  the 
Link:Transform  and  Link: Analyze  buttons  will  activate. 

The  GoTo  button  changes  the  focus  of  the  data  analysis, 
and  of  the  strategy,  to  a  new  data  or  model  object.  When  a 
new  object  has  been  created  and  named,  then  the  name  of 
that  object  replaces  Data  or  Model  in  the  GoTo  button. 
Then,  when  the  GoTo  button  is  clicked,  the  appropriate 
data  or  model  icon  is  highlighted  in  the  workmap,  and  the 
appropriate  strategy  is  displayed  in  the  guidemap  window. 

All  buttons  other  than  flow  buttons  are  procedure  buttons 
that  activate  data-analysis  procedures.  In  Figure  2  we  see 
procedure  buttons  such  as  List  Variables  and  Visualize 
Data.  When  an  active  procedure  button  is  clicked,  the 
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indicated  data-analysis  procedure  (listing  variables,  show¬ 
ing  the  datasheet)  is  activated. 

3.2  The  Focus  of  the  Strategy 

The  focus  of  a  statistical  strategy  is  a  data-analysis  object 
(a  dataset  or  a  model).  In  Figure  1,  the  icons  named  Car- 
Ratings  and  Norm-CarRatings  are  data  icons,  whereas 
PCA-CarPrefs  is  a  model.  The  focus  object  is  repre¬ 
sented  by  the  icon  that  is  highlighted  in  the  workmap. 

Each  time  a  new  object  is  created,  it  is  represented  by  a 
new  icon.  Whenever  a  new  dataset  or  model  object  is 
derived  from  an  existing  dataset  object,  an  arrow  is  drawn 
from  each  of  the  new  object’s  parents  (usually  only  one)  to 
the  new  object  to  show  the  creation  dependency.  These 
arrows  have  a  meaning  that  parallels,  but  is  somewhat  dif¬ 
ferent  from,  their  meaning  in  the  guidemap:  They  repre¬ 
sent  the  flow  of  data  into  or  out  of  a  data-analysis  object 
(dataset  or  model)  or  procedure  as  a  result  of  a  data  ana¬ 
lyst’s  action.  In  the  guidemap,  on  the  other  hand,  a  arrows 
represent  potential  actions  a  data  analyst  might  take. 

The  evolving  progress  of  the  data-analysis  session  is 
shown  in  the  workmap.  Certain  actions  taken  via  the 
guidemap  create  new  nodes  in  the  workmap.  A  new 
dataset  object  may  be  created  by  a  mathematical  procedure 
(such  as  normalization  or  principal  components  analysis) 
or  by  a  non-mathematical  operation  (such  as  removing 
variables  or  merging  datasets).  A  new  model  object  is 
always  created  by  a  mathematical  procedure.  A  procedure 
icon  appears  between  the  original  and  new  objects  when 
the  creation  involved  mathematical  operations,  otherwise, 
no  procedure  icon  appears.  If  a  procedure  icon  appears,  the 
creation  dependency  arrow  is  drawn  from  the  parent 
objects  through  the  procedure  to  the  new  object.  Naturally, 
a  new  object  may  be  brought  in  from  “outside”  of  the  sys¬ 
tem,  in  which  case  the  new  object  is  not  connected  to  a 
parent  (e.g.,  CarRatings  in  Figure  1). 

The  specific  object  which  is  the  focus  of  the  analysis  (and, 
therefore,  of  the  analytic  strategy)  is  highlighted  in  the 
workmap.  In  Figure  1,  Scores  &  Ratings  is  the  focus 
object.  Any  data  or  model  object  in  the  workmap  can  be 
selected  at  any  time  to  be  the  new  focus  object.  When  a 
new  focus  object  is  selected,  the  new  strategy  associated 
with  it  is  displayed  in  the  guidemap  window,  and  the  user 
enters  that  strategy. 

The  workmap  and  guidemap  graphs  differ  in  several 
respects.  First,  the  structure  of  the  guidemap  graph  doesn’t 
change,  it  remains  as  shown  throughout  the  analysis. 


although  its  high-lighting  changes.  The  workmap  graph, 
on  the  other  hand,  grows  as  new  data  and  model  objects 
are  created  and  as  new  analysis  procedures  are  used  (both 
stmcture  and  high-lighting  change).  Second,  the  guidemap 
is  a  (potentially)  cyclic  graph,  whereas  the  workmap  is  an 
acyclic  hierarchical  tree  graph.  This  represents  Lubinsky 
and  Pregibon’s  (1988)  observation  that  actions  taken  dur¬ 
ing  data  analysis  are  not  hierarchical,  but  are  cyclical, 
although  the  resulting  analysis  is  hierarchical.  Third,  the 
guidemap  (as  represented  by  the  initial  guidemap  shown 
in  Figure  1,  and  all  its  sub-guidemaps)  has  an  entry  point 
but  no  exit  point,  whereas  workmaps  have  both  entries  and 
exits.  This  represents  the  fact  that  a  strategy  has  a  begin¬ 
ning  step  but  no  final  step.  The  lack  of  an  exit  point  from  a 
strategy  reflects  the  fact  that  a  strategy  is  cyclic,  and  that 
users  should  be  able  to  quit  a  strategy  (with  the  window’s 
close  box)  whenever  they  choose. 

3.3  The  Role  of  the  Expert  Statistician 

We  turn  now  from  the  first  two  aspects  of  our  definition  of 
strategy  (the  formal  representation  and  the  focus)  to  the 
third  aspect,  namely  that  a  statistical  strategy  represents 
the  conceptual  structure  of  an  expert  statistician. 

It  is  assumed  that  the  expert  is  only  expert  in  a  proscribed 
domain  of  statistical  analysis,  not  for  the  entire  domain. 
The  role  of  such  an  expert  is  to  decide,  for  the  expert’s 
area  of  statistical  analysis  expertise,  what  steps  are 
involved,  and  in  what  order  the  steps  should  be  taken. 
Thus,  the  representation  shown  in  the  guidemaps  in  Fig¬ 
ures  1  and  2  (and  in  other  guidemaps  that  are  not  shown)  is 
on  an  experts  knowledge  about  exploratory  data  analysis. 
These  guidemaps  represent  the  expert’s  conceptual  struc¬ 
ture  of  the  sequence  of  steps  involved  in  exploratory  data 
analysis.  The  expert  creates  these  guidemaps  by  using  the 
“guidetools”  that  are  discussed  in  Section  5.0. 

3.4  The  Objects,  Procedures  and  Actions 

The  final  aspect  of  our  definition  of  strategy  is  that  the 
expert’s  conceptual  data  analysis  structure  concerns  three 
classes  of  things  and  the  relationships  among  these  things. 
The  things  are  the  data-analysis  objects,  the  data-analysis 
procedures,  and  the  data  analyst's  actions.  All  three  are 
included  in  our  representation  of  statistical  strategy. 

Data-analysis  Objects:  There  are  two  types  of  data-anal¬ 
ysis  objects:  dataset  objects  and  model  objects.  Both  types 
of  data-analysis  objects  are  represented  by  icons  in  the 
workmap  (but  not  in  the  guidemap).  Datasets  are  repre¬ 
sented  by  tall  rectangular  icons  containing  very  narrow 
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vertical  bars  (representing  variables).  Models,  like  data, 
are  represented  by  tall  rectangular  icons,  but  they  contain 
mathematical  symbols  as  well  as  “variable”  bars  to  reflect 
the  fact  that  models  are  data  that  have  been  subjected  to 
mathematical  operations.  The  highlighted  data>analysis 
object  is  the  focus  of  the  statistical  strategy. 

Data-analysis  Procedures:  Procedures  are  represented  by 
the  wide  rectangular  icons  in  the  workmap  and  guidemap 
windows.  The  procedures  are  the  nodes  of  the  guidemap’s 
strategy  structure,  with  each  node  being  a  self-contained 
piece  of  statistical  computation,  including  visualizations 
(construction  and  presentation  of  dynamic  statistical 
graphics),  tables  and  textual  results.  These  procedure- 
nodes,  in  the  exploratory  data-analysis  example  shown  in 
Figure  2,  include  the  show  datasheet,  list  variables  and  list 
observations  nodes,  the  data  visualization,  reporting,  and 
summary  nodes,  and  the  node  to  create  new  data.  These 
are  the  kinds  of  exploration  procedures  the  expert  deems 
to  be  appropriate  parts  of  the  analysis  strategy. 

Data  Analyst’s  Actions:  The  possible  actions  of  the  data 
analyst  are  represented  in  the  guidemap  by  the  arrows  con¬ 
necting  the  procedure  icons.  On  the  other  hand,  in  the 
workmap  the  arrows  indicate  actions  that  the  data  analyst 
has  already  taken.  In  the  guidemap  window,  the  direction 
of  the  arrow  indicates  the  order  in  which  the  expert  thinks 
the  novice  should  use  the  data-analysis  procedures.  Thus, 
the  data  exploration  strategy  in  Figure  2  indicates  that  the 
expert  thinks  the  first  three  steps  should  be  looking  at  the 
data  themselves  or  listing  their  variable  names  or  observa¬ 
tion  labels.  Note  that  these  procedure-buttons  are  high¬ 
lighted  and  others  are  not.  Once  all  three  of  these  actions 
are  taken,  the  next  three  buttons  become  highlighted  (and 
the  first  three  become  gray),  indicating  that  the  next  three 
analysis  procedures  are  now  available.  In  this  way,  the 
novice  is  guided  through  the  data  exploration  strategy.  At 
least  one  of  the  procedure-buttons  in  the  guidemap  win¬ 
dow  is  always  active,  indicating  which  of  the  procedures 
can  be  used  next  by  the  analyst.  Initially,  when  a  strategy 
is  entered,  certain  procedure(s)  are  highlighted,  indicating 
what  the  analyst  should  do,  and  that  the  system  is  waiting 
for  an  action. 


4.0  Using  Statistical  Strategies 


In  the  previous  section  we  described  how  we  represent  our 
concept  of  statistical  strategy,  a  representation  involving 
two  graphs,  called  the  guidemap  and  workmap.  In  this  sec¬ 
tion  we  describe  how  the  data  analyst  uses  these  two 
graphs. 


4.1  Using  the  Guidemap 

The  guidemap  window  presents  a  map  of  an  expert’s  sta¬ 
tistical  strategy.  This  map  is  used  to  guide  data  analyses 
performed  by  novice  analysts.  At  the  very  beginning  of  the 
analysis  of  a  new  dataset  object  (see  Figure  1),  the 
guidemap  window  contains  the  Rnalysis  Cycle 
guidemap.  This  guidemap  presents  the  overall  flow  of  a 
data  analysis,  emphasizing  the  major  steps  and  their  cycli¬ 
cal  relationship.  The  initial  highlighting  of  this  map  guides 
the  user  to  explore  the  data,  since  the  Llnk:Explore  button 
is  the  only  active  (highlighted)  button. 

The  flow  of  guidance  is  indicated  by  the  arrows  connect¬ 
ing  buttons:  When  an  active  button’s  action  is  completed, 
the  button  deactivates  (changes  to  gray),  and  the  buttons 
that  are  pointed  to  by  its  arrows  are  activated.  The  change 
in  high-lighting  indicates  the  actions  that  the  user  is 
guided  to  take  next,  and  the  arrows  indicate  how  guidance 
flows.  Therefore,  in  Figure  1,  after  the  data  are  explored 
the  analyst  is  guided  to  transformation  or  analysis. 

Lets  consider  how  the  guidemap  in  Figure  1  works.  First 
of  all,  note  that  all  of  the  buttons  in  the  guidemap  are 
macro  buttons:  Whenever  one  of  them  is  used  a  new  strat¬ 
egy  map  will  replace  the  one  shown  in  the  figure.  When 
the  new  strategy  map  is  completed,  the  user  will  once 
again  be  shown  the  map  in  the  figure,  although  it’s  pattern 
of  high-lighting  will  have  changed  as  indicated  by  the 
arrows.  Thus,  after  exploring  the  data,  the  transformation 
and  analysis  (i.e.,  model  fitting)  buttons  become  high¬ 
lighted.  If  transformation  is  chosen  first,  then  when  this  is 
completed  the  analyst  will  be  guided  to  analyze  the  data. 
If,  instead,  analysis  is  chosen  before  transformation,  then 
when  the  analysis  is  complete  the  GoTo: Model  button  will 
become  highlighted.  Note,  however,  that  if  the  Trans¬ 
form  button  was  not  used  before  the  analysis,  it  will 
remain  highlighted,  so  that  the  user  now  has  the  choice  of 
either  transforming  the  data  and  then  re-analyzing,  or  of 
proceeding  to  look  at  the  model.  Finally,  after  looking  at 
the  model,  the  user  can  either  transform  the  data  once 
again,  or  start  over  with  a  new  set  of  data.  Thus,  this  map 
represents  the  expert’s  view  that  data  analysis  is  a  cycle 
that  begins  with  exploration  and  which  may  or  may  not 
involve  transformation  before  the  first  data  analysis 
(model  fitting).  Then,  the  model  resulting  from  the  analy¬ 
sis  should  be  looked  at.  The  model  may  or  may  not  sug¬ 
gest  re-transformation,  with  this  cycle  of  transformation, 
analysis  and  model  inspection  continuing  indefinitely. 

Note  that  when  a  new  dataset  object  is  created  (for  exam¬ 
ple,  by  transformation)  the  user  will  always  be  given  the 
choice  to  change  the  focus  of  the  strategy  to  the  new  data. 
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thus  beginning  the  analysis  cycle  all  over  again  with  a 
brand  new,  unused  Rnalysis  Cycle  guidemap,  starting 
with  data  exploration.  On  the  other  hand,  the  analyst  may 
also  continue  focusing  on  the  old  data,  if  desired,  although 
usually  when  new  data  are  created  the  user  will  shift  focus 
to  them.  Thus,  there  is  an  implicit  cycle  in  the  data-analy- 
sis  process  that  does  not  appear  in  the  guidemap:  When¬ 
ever  new  data  are  created  the  analysis  cycle  usually 
recommences. 

Let  us  now  turn  to  consider  what  happens  when  the 
Link:Explore  button  is  used.  Since  this  button  is  a  macro 
button  (i.e.,  a  button  which  corresponds  to  another  gui¬ 
demap),  when  it  is  used  the  map  in  the  window  changes  to 
the  Enplore  Data  guidemap  shown  in  Figure  2.  Now,  as 
indicated  by  the  button  highlighting,  the  analyst  has  the 
choice  of  three  actions:  show  the  datasheet,  list  variable 
names  or  list  observation  labels.  When  the  user  chooses 
any  one  of  these  three  actions,  the  action  takes  place  and 
the  chosen  button  turns  gray,  since  it  is  no  longer  a  recom¬ 
mended  action.  The  other  two  buttons  remain  highlighted. 

Notice  that  the  just-used  button  is  connected  to  a  short  ver¬ 
tical  arrow  rather  than  to  another  button.  This  short  verti¬ 
cal  arrow  is  called  an  and  icon  because  it  is  an  ‘  and  gate 
that  restricts  the  flow  of  guidance  from  one  action  to  the 
next.  Specifically,  all  of  the  buttons  that  are  connected  to 
an  and  icon  must  be  used  before  guidance  can  flow 
through  the  icon  to  the  buttons  that  follow  it. 

Thus,  when  one  of  the  active  buttons  in  Figure  2  is  used, 
no  other  buttons  become  highlighted  until  all  three  active 
buttons  are  used.  Then,  all  of  the  buttons  that  have  arrows 
pointing  to  them  from  the  and  icon  are  activated.  In  this 
way  the  user  is  guided  to  use  all  three  active  buttons  in 
Figure  2  before  doing  anything  else.  They  can  be  used  in 
any  order.  Once  they  are  all  used,  the  next  group  of  three 
buttons  is  activated,  and  the  analyst  must  use  them  (in  any 
order)  before  going  on.  After  these  three  buttons  have  been 
used,  the  map  appears  as  shown  in  Figure  3. 

The  guidemap  in  Figure  3  has  changed  from  the  one  in 
Figure  2:  The  data  analyst  is  now  being  guided  to  either 
return  to  the  guidemap  which  led  to  this  one  (the  one 
shown  in  Figure  1,  but  with  the  T ransform  and  Analyz© 
buttons  activated)  or  to  create  a  new  dataset  object.  The 
analyst  may  wish  to  take  the  latter  step  to  create  a  subset 
of  the  original  data.  If  the  decision  is  made  to  create  new 
data,  then  the  analyst  has  the  choice  of  going  to  those  data, 
which  brings  up  a  brand  new  Analysis  CyclB  map  (iden¬ 
tical  to  that  shown  in  Figure  1)  or  of  returning  to  the  old 


Rnalysis  Cycle  map  (with  the  structure  shown  in  Figure 
1,  but  with  Transform  and  Analyze  activated). 

Note  how  the  strategy  has  guided  the  analyst:  As  shown  in 
Figure  1,  the  analyst  must  explore  the  data  first.  The  ana¬ 
lyst  must  analyze  the  data  before  inspecting  the  model.  In 
Figure  2  and  3  the  analyst  must  look  at  the  data  and  their 
identifying  information  before  visualizing  the  data  or  get¬ 
ting  summary  statistics.  On  the  other  hand,  the  data  ana¬ 
lyst  has  choices:  In  Figure  1,  it  is  not  required,  though  it  is 
possible,  to  transform  the  data  before  fitting  the  model. 
Similarly,  in  Figure  2,  it  is  possible  to  visualize  the  data 
before  seeing  summary  statistics,  or  to  do  the  actions  in 
the  reverse  order. 

4.2  Using  the  WorkMap 

In  the  example  shown  in  Figure  1,  the  workmap  shows  a 
data-analysis  session  that  has  already  involved  several 
major  steps.  In  the  first  step,  the  analyst  read  in  the  data 
that  defined  the  CarPrefs  dataset  object.  These  data  were 
then  submitted  to  a  Principal  Components  Analysis,  as 
indicated  by  the  PrnCmp  procedure  icon.  This  analysis 
produced  the  PCA-CarPrefs  model  object.  The  analyst 
then  requested  that  a  new  dataset  object  Scores-PCA- 


Figure  3:  Strategy  for  Exploring  Data 
after  using  several  analysis  procedures. 
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CarPrefs  be  created  by  the  model  object.  Separately,  the 
analyst  also  read  in  data  that  defined  the  CarRatings 
dataset  object.  These  data  were  normalized,  as  indicated 
by  the  Norm  procedure  icon,  creating  a  new  dataset  object 
named  Norm-CarRatings,  which  was  merged  with  the 
Scores-PCA-CarPrefs  dataset  object  to  obtain  another 
dataset  object  named  Scores  &  Ratings  (the  current 
focus  of  the  statistical  strategy). 

It  should  be  emphasized  that  portions  (or  all)  of  the  data 
analysis  can  be  created  directly  in  the  workmap  window, 
without  using  the  guidemap  window,  whenever  a  suffi¬ 
ciently  sophisticated  data  analyst  wishes.  An  entire  data 
analyses  can  be  created  from  the  workmap  without  ever 
seeing  a  guidemap.This  can  be  done  by  clicking  the  mouse 
on  the  body  of  an  icon  to  obtain  a  pop-up  menu  of  actions 
that  the  icon  supports.  These  menu-items  are  also  accessi¬ 
ble  from  the  Data  and  Transform  menus  of  the  menu- 
bar  shown  at  the  top  of  Figure  1.  The  pop-up  menu  for 
model  icons  corresponds  to  the  Model  menu  in  the  menu- 
bar.  The  analysis  procedures  are  accessed  from  the 
menubar’s  Hnalyze  menu  (and  from  an  optional  work- 
map  toolbar  that  is  now  shown). 

It  should  also  be  emphasized  that  a  previous  portion  of  the 
data  analysis  can  be  revisited  at  any  time  by  simply  click¬ 
ing  on  the  appropriate  workmap  icon.  Then,  the  analysis 
can  be  continued  in  a  new  direction  by  simply  taking  dif¬ 
ferent  steps  than  were  taken  previously.  Thus,  the  work- 
map  graph  provides  a  very  convenient  and  simple  way  of 
backtracking,  a  feature  that  can  be  very  hard  with  conven¬ 
tional  systems  which  do  not  keep  a  full  history  of  a  data 
analysis  session.  This  can  be  done  across  sessions  by  sav¬ 
ing  (portions  of)  the  workmap  and  reloading  it  in  a  later 
session  (only  partially  implemented  at  this  time). 

Also,  note  that  if  the  data  analyst  is  performing  the  analy¬ 
sis  directly  from  the  workmap  guidance  is  available  at 
anytime  by  simply  requesting  that  the  guidemap  be  shown. 
When  so  requested,  the  appropriate  portion  of  the 
guidemap  structure  is  displayed  in  the  guidemap  window. 
Thus,  it  is  possible  for  the  data  analyst  to  use  guidance 
when  needed,  and  to  avoid  it  when  it  is  not  needed. 

5.0  Creating  Statistical  Strategies _ 

The  guidemaps  that  embody  statistical  strategy  are  created 
while  in  “authoring”  mode.  In  this  mode  there  is  an 
Author WorkBench  window  in  which  new  guidemaps 
are  created.  In  addition,  a  Tools  menu  is  added  to  the 
menubar,  and  the  action  of  all  Data,  Transform,  Analyze 
and  Model  menu  items  is  enhanced. 


Taken  together,  the  modified  menu  items  and  the  new 
Tools  menu  items  are  “guidetools”  that  are  used  to  create 
new  guidemaps.The  expert  uses  these  guidetools  to  create 
the  buttons  that  are  to  become  the  nodes  of  the  guidemap. 
Recall  that  there  are  flow  buttons,  which  control  the  flow 
between  portions  of  the  analysis,  and  procedure  buttons, 
which  control  the  use  of  data-analysis  procedures.  The 
Tools  menu  creates  flow  buttons,  while  the  other  menus 
create  procedure  buttons. 

Procedure  buttons  are  created  by  using  those  menu  items 
that  are  needed  to  perform  the  specific  type  of  data  analy¬ 
sis  for  which  guidance  is  being  created.  When  in  authoring 
mode,  the  action  of  the  menu  items  is  modified  so  that,  in 
addition  to  the  analysis  action  taking  place,  a  button  is 
placed  on  the  author’s  workbench  (the  button’s  title  is  the 
same  as  the  menu  item’s  name). 

Note  the  basic  design  philosophy  underlying  the  creation 
of  statistical  strategies:  The  expert  creates  the  guidemap’s 
data-analysis  procedure  buttons  by  using  the  menu  system 
in  exactly  the  same  way  that  s/he  would  use  it  when  it  is 
not  in  authoring  mode.  Since  the  system  is  in  authoring 
mode,  buttons  appear  in  the  workbench  window.  Other¬ 
wise,  everything  is  the  same  as  when  the  system  is  not  in 
authoring  mode.  This  design  feature  means  that  the  expert 
is  free  to  perform  whatever  analysis  is  desired,  using  what¬ 
ever  data-procedures  are  appropriate,  without  any  new 
authoring  “features”  changing  the  process. 

On  the  other  hand,  flow  buttons,  which  do  not  correspond 
to  data-analysis  actions,  are  created  by  using  the  new 
authoring  “features”  that  are  represented  by  items  of  the 
Tools  menu.  There  is  a  menu  item  for  each  type  of  flow 
button,  including  Link,  GoTo,  Return  and  And  items  (for 
icons  shown  in  previous  figures),  AutoLink  and  AutoRe- 
turn  items  that  cause  a  guidemap  to  automatically  link  to 
another  guidemap  and  to  automatically  return  to  the 
linked-from  guidemap,  and  an  Initial  item  to  indicate 
which  buttons  are  to  be  activated  when  the  guidemap  is 
initially  displayed.  Thus,  while  the  author  does  not  need  to 
learn  any  new  aspects  of  the  system  while  creating  the  pro¬ 
cedure  steps  of  the  data  analysis,  new  features  must  be 
learned  to  indicate  flow  control  (the  actual  guidance).  In  a 
more  complete  implementation,  many  additional  flow- 
control  features  would  be  available. 

Once  the  expert  has  placed  two  or  more  buttons  or  icons 
on  the  workbench,  s/he  can  connect  them  together  with  an 
arrow  drawing  tool.  Of  course,  at  any  time  the  buttons  and 
icons  can  be  dragged  to  new  locations  to  give  the 
guidemap  a  more  pleasing  and  comprehensible  layout.  The 
arrows  automatically  reposition  themselves  to  reflect  the 
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new  layout.  Of  course,  when  the  map  is  entirely  created, 
the  expert  saves  it  for  later  use  by  the  novice. 

Finally,  the  expert  must  create  the  help  information  that  is 
displayed  when  the  novice  clicks  on  the  ??  side  of  a  but¬ 
ton.  This  is  done  by  using  an  ordinary  text  editor,  and  by 
saving  files  with  certain  naming  conventions  so  that  they 
can  be  found  when  needed. 


6.0  Discussion _ _ 

In  this  section  we  discuss  the  relation  of  our  work  to 
hypertext  and  to  visual  programming,  two  concepts  with 
their  origins  in  computer  science. 

6.1  Hypertext  and  Hypercode 

Hypertext  (or,  more  generally,  hypermedia)  is  a  generic 
approach  to  linking  and  structuring  all  forms  of  computer¬ 
ized  materials  so  that  non-linear,  dynamic  documents  can 
be  constructed  (for  more  information,  consult  Woodhead 
(1990)  or  Martin  (1990)).  Hypermedia  consist  of  nodes 
that  are  connected  by  links.  The  nodes  contain  the  materi- 
3.1s,  which  m3y  be  text,  disgruins,  unimutions,  imuges, 
video,  sound,  computer  programs  or  any  other  computer¬ 
ized  information.  The  links  provide  a  mechanism  for  non¬ 
linear  navigation  among  the  nodes.  The  nodes  may  be 
linked  together  into  web,  hierarchical,  cyclic,  or  other 
structures.  Hypermedia  always  have  tools  for  navigating 
the  link  structure  and  for  displaying  the  node  material. 

Clearly,  our  help  system  is  a  hypertext:  Guidemap  buttons 
are  nodes  that  contain  help  text,  and  arrows  are  links 
between  nodes.  In  addition,  the  ??  side  of  a  guidemap  but¬ 
ton  is  the  tool  that  accesses  and  displays  the  hypertext.  The 
buttons  also  navigate  the  hypertext.  Finally,  the  structure 
of  the  hypertext  is  shown  by  the  structure  of  the  guidemap. 

Of  much  more  interest  is  the  fact  that  our  guidance  system 
is  a  “hypercode”,  a  form  of  hypermedia  where  the  materi¬ 
als  are  computer  programs.  Note  that  the  structure  of  the 
hypercode  is  represented  by  the  structure  of  the  guidemap, 
and  that  the  hypercode  is  navigated  by  clicking  on  the  !! 
side  of  guidemap  buttons.  When  the  naive  analyst  clicks 
on  the  ! !  side  of  a  button,  the  button  not  only  navigates  to  a 
particular  piece  of  hypercode,  but  also  causes  the  execu¬ 
tion  of  that  piece  of  code.  Thus,  from  the  point-of-view  of 
the  naive  user,  the  guidemaps  display  the  structure  of  the 
guidance  hypercode,  provide  a  means  of  navigating 
through  it,  and  a  means  of  executing  pieces  of  it.  (Note 
that  the  guidemaps  also  display  the  structure  of  the  help 


hypertext,  provide  a  means  of  navigating  through  it,  and 
for  displaying  pieces  of  it.  Thus,  both  the  hypertext  and 
hypercode  are  seamlessly  unified.) 

It  follows  that  the  expert  user’s  process  of  authoring 
guidemaps  is,  in  fact,  a  process  for  writing  hypercode.  As 
described  above,  authoring  involves  creating  two  kinds  of 
buttons:  action  buttons  and  flow  buttons.  When  an  action 
button  is  created,  the  code  that  is  written  is  a  ViSta  func¬ 
tion  which  parallels  a  data-analysis  menu  item  and  which 
causes  a  data-analysis  step  to  take  place.  On  the  other 
hand,  when  the  author  creates  a  flow  button,  the  code  that 
is  written  consists  of  standard  Lisp  flow  control  functions. 

Thus,  authoring  guidemaps  is  computer  programming. 
However,  it  is  not  the  usual  type  of  programming  in  which 
the  programmer  types  statements.  Rather,  it  is  one  in 
which  the  statements  get  generated  automatically  when  the 
author  (programmer)  selects  a  button.  This  form  of  com¬ 
puter  programming  is  known  as  visual  programming, 
which  is  discussed  in  the  next  section. 

6.2  Visual  Programming  &  Program 
Visualization 

Visual  programming  and  program  visualization  are  very 
active  areas  of  research  in  computer  science.  There  goal  is 
to  simplify  programming,  and  to  make  programming 
accessible  to  a  wider  audience.  They  attempt  to  reach  Ais 
goal  by  combining  the  disciplines  of  interactive  graphics, 
computer  languages  and  software  engineering  to  take 
advantage  of  a  person’s  non-verbal  visual  capabilities  and 
a  computer’s  interactive  graphical  capabilities. 

Conventional  textual  computer  languages  process  program 
instructions  that  exists  in  one-dimensional,  nongraphical 
(textual)  streams.  Visual  programming,  by  contrast,  refers 
to  a  way  for  people  to  create  programs  using  graphical 
methods.  These  icons  can  be  viewed  as  two-dimensional 
graphical  instructions  (Myers,  1990),  as  opposed  to  one¬ 
dimensional  textual  instructions  (although  the  two-dimen¬ 
sional  visual  program  is  translated  into  an  underlying  one¬ 
dimensional  textual  program). 

Program  visualization,  on  the  other  hand,  is  an  entirely 
different  concept:  Here,  the  program  is  specified  in  the 
usual  textual  manner,  but  is  then  illustrated  visually  in 
some  form.  Thus,  the  program  is  specified  as  text  and 
translated  into  graphics.  Note  that  this  reverses  the  process 
involved  in  visual  programming,  where  the  program  is 
specified  as  graphics  and  is  translated  into  text. 
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Guidemaps  and  workmaps  are  simple  examples  of  visual 
programming  and  program  visualization.  Guidemaps  are 
visual  programs  which  have  been  created  by  an  expert 
using  a  visual  authoring  system,  and  which  are  “executed” 
by  the  novice.  Workmaps  are  program  visualizations 
which  have  been  created  textually  (or  visually).  In  fact, 
when  a  workmap  is  saved  and  re-executed,  it  becomes  a 
visual  program  as  well  as  a  program  visualization. 

The  earliest  visual  languages  were  computerized  flow¬ 
charts.  More  recently,  visual  languages  are  formally  based 
on  graph  theory,  consisting  of  nodes  and  edges  (note  the 
connection  with  hypertext).  Often  the  edges  are  directed 
(and  called  arrows).  There  are  graphs  such  as  “higraphs”, 
which  allow  nodes  to  contain  other  nodes  and  which  per¬ 
mit  arrows  to  split  and  join,  or  “colored  petri  nets”  which 
allow  parallel  processing  systems  to  be  constructed.  A 
number  of  visual  programming  systems  use  dataflow  dia¬ 
grams.  Here  the  operations  are  typically  put  in  nodes,  and 
the  data  flow  along  the  arrows  connecting  the  nodes. 

We  have  based  guidemaps  on  directed  cyclic  graphs  and 
workmaps  on  directed  acyclic  dataflow  diagrams  (Young 
&  Smith,  1991).  Our  developments  are  limited,  however, 
in  that  we  have  not  developed  looping  or  conditional 
branching.  Thus,  one  can  argue  that  our  workmaps  and 
guidemaps  do  not  constitute  a  full  visual  programming 
language,  since  the  abstract  definition  of  a  computer  lan¬ 
guage  requires  the  inclusion  of  these  capabilities. 

We  recommend  investigating  the  feasibility  of  developing 
(or  using  an  existing)  visual  dataflow  language  as  the  basis 
for  a  structured  graphical  interface  for  performing  and 
guiding  data  analysis.  Two  interesting  existing  systems  are 
VisaVis  (Poswig,  Vrankar  &  Morara,  1994)  and  Khoros 
(Rasure  &  Williams,  1991).  Both  are  functional  visual 
programming  languages  with  looping  and  conditional 
branching.  Khoros  is  also  a  dataflow  language. 

7.0  Conclusion 


Understanding  and  representing  statistical  strategy  is  a  rel¬ 
atively  new  area  of  research  that  is  just  now  gaining 
momentum.  Within  this  area  of  research,  it  appears  that 
our  visual  approach  to  statistical  strategy  is  new  and 
unique,  and  is  firmly  based  on  current  computer  science 
thinking.  As  the  capability  of  computers  continues  to 
increase,  while  their  price  continues  to  decrease,  the  audi¬ 
ence  for  complex  software  systems  such  as  data-analysis 
systems  will  become  wider  and  more  naive.  Thus,  it  is 
imperative  that  these  systems  be  designed  to  guide  data 


analysts  who  need  the  guidance,  while  at  the  same  time  be 
able  to  provide  full  data-analysis  power.  An  efficacious 
way  of  doing  this  is  certainly  needed,  and  we  believe  that 
our  visualized  statistical  strategies  have  the  potential  for 
great  payoff  in  the  improvement  of  the  quality,  satisfaction 
and  productivity  of  statistical  data  analysis. 

Naturally,  we  hope  that  our  visual  methods  for  guiding 
naive  data  analysts  by  visually  representing,  using  and  cre¬ 
ating  statistical  strategies  will  prove  useful.  Of  much 
greater  importance,  however,  is  our  basic  point:  Concen¬ 
trated  attention  should  be  given  by  computational  statisti¬ 
cians  to  the  representation,  usage  and  creation  of  statistical 
strategies.  We  believe  that  such  strategies  should  be  avail¬ 
able  to  guide  and  structure  the  data-analysis  process  so 
that  relatively  naive  users  can  perform  high-quality  data 
analyses.  And  we  believe  that  guidance  systems  should  be 
empirically  tested  to  see  if  they  deliver  on  their  promise. 
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Abstract 


This  paper  demonstrates  the  ability  of  ViSta-MDS  (Young, 
1994;  McFarlane,  1992)  to  facilitate  guided  data  analysis 
in  the  framework  of  multidimensional  scaling.  The  paper 
begins  with  a  brief  description  of  the  goals  of  a  multidi¬ 
mensional  scaling  analysis.  Next,  the  guidance  properties 
of  the  ViSta-MDS  module  are  described.  The  paper  illus¬ 
trates  each  step  in  a  guided  data  analysis,  and  shows  many 
of  the  GuideMaps  encountered  along  the  way. 

1 .0  Multidimensional  Scaling 


The  data  for  a  multidimensional  scaling  analysis  are  in  the 
form  of  dissimilarity  matrices,  e.g.,  a  group  of  judges  rates 
the  dissimilarity  between  a  number  of  stimuli.  Each  stim¬ 
ulus  in  the  set  is  compared  with  every  other  stimulus  to 
form  a  matrix  of  dissimilarity  judgments.  Each  judge  con¬ 
tributes  one  matrix  of  dissimilarities  to  the  data  set. 

The  goal  of  a  multidimensional  scaling  (MDS)  analysis  is 
to  produce  a  low-dimensional  solution  space  such  that  the 
Euclidean  distances  between  the  points  in  the  solution 
space  most  closely  approximate  the  dissimilarity  judg¬ 
ments  provided  by  the  judges.  A  popular  measure  of  fit  in 
multidimensional  scaling  analyses  is  stress,  defined  as  the 
square  root  of  the  sum  of  the  squared  differences  between 
the  dissimilarity  judgments  provided  by  the  judges  and  the 
Euclidean  distances  in  the  MDS  solution  space. 

2.0  Guided  Statistical  Analysis _ 

As  is  true  with  many  statistical  models,  users  of  multidi¬ 
mensional  scaling  are  often  not  familiar  with  the  tech¬ 


niques  and  assumptions  associated  with  such  an  analysis. 
In  order  to  accommodate  the  wide  variety  of  users,  we 
have  developed  a  guided  statistical  analysis  system  in 
which  expert  users  may  provide  guidance  for  less  experi¬ 
enced  or  novice  analysts.  In  this  system,  expert  users  cre¬ 
ate  a  statistical  strategy  for  novice  users  to  follow.  The 
strategy  may  be  represented  either  graphically  or  in  a  text 
file;  for  the  purposes  of  this  exposition,  we  will  focus  on 
the  graphical  representation  of  the  expert’s  statistical  strat¬ 
egy,  tailed  the  GuideMap. 


A  user  begins  a  guided  data  analysis  session  by  selecting 
’’Show  GuideMap”  from  either  the  ’’Command”  menu  or 
the  WorkMap;  the  first  GuideMap,  shown  in  Figure  1, 


FIGURE  1 .  The  initial  ViSta  GuideMap  prompts  the  user 
to  load  data. 


appears.  As  described  in  more  detail  by  Young  &  Lubin- 
sky  (1994),  a  GuideMap  consists  of  buttons  which  can  be 
used  to  carry  out  steps  in  the  analysis.  The  initial 
guidemap  is  very  simple:  It  simply  prompts  the  user  to 
load  data.  If  the  user  clicks  on  the  left  half  of  the  ’’Load 
Data”  icon  (on  the  ??),  a  help  screen  appears  and  the  user 
is  given  information  about  the  loading  of  data  in  ViSta.  If 
the  user  clicks  in  the  right  half  of  the  icon  (on  the  ! !),  load¬ 
ing  a  data  file  is  initiated.  All  guidemap  buttons  can  pro¬ 
vide  help  about  an  analysis  step  and  can  cause  the  step  to 
be  taken. 


Copyright  ©  1994  by  Mary  M.  McFarlane.  All  rights  reserved.  For  further  information 
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When  the  user  selects  a  dissimilarity  data  set  to  be  loaded, 
the  GuideMap  is  updated,  becoming  the  one  shown  in  Fig¬ 
ure  2.  As  is  the  case  for  most  guidemaps,  this  one  has  sev¬ 
eral  buttons.  Some  of  these  buttons  are  “active”  (the  dark, 
highlighted  ones),  whereas  some  are  “inactive”  (the  gray 
ones).  Active  buttons  (such  as  the  “LinkrExplore”  button 
in  Figure  2)  are  ready  to  cause  an  action.  Once  the  button’s 
action  has  taken  place,  the  highlighting  (activation)  of  the 
buttons  changes:  The  clicked  button  deactivates,  and  the 
button(s)  to  which  it  points  are  activated.  Inactive  buttons 
(such  as  the  “Multidimensional  Scaling”  button  in  Figure 
2)  are  not  available  for  any  action:  Clicking  on  them  has 
no  effect. 

The  GuideMap  in  Figure  2  prompts  the  user  to  explore  the 
data.  Clicking  the  "Link:  Explore"  button  causes  the 
GuideMap  that  guides  users  through  a  data  exploration  to 
appear.  This  GuideMap  is  shown  in  Figure  3.  This 
GuideMap  has  three  sections:  First,  the  user  is  guided  to 
examine  the  datasheet,  list  the  variables,  and  list  the  obser¬ 
vations.  Note  that  only  these  three  buttons  are  highlighted, 
so  only  these  actions  can  take  place.  Once  all  three  of 
these  actions  occur,  the  next  three  buttons  are  highlighted, 
indicating  that  the  user  is  now  guided  to  visualize  the  data, 
to  get  a  data  report,  or  to  compute  summary  statistics. 
Finally,  when  the  user  has  used  these  three  buttons,  the 
“Return”  and  “Create  Data”  buttons  are  available,  permit¬ 
ting  the  user  to  return  to  the  previous  guidemap  (the  one 
shown  in  Figure  2,  but  with  the  highlighting  changed)  or 
to  create  new  data. 


Let  us  examine  the  data-visualization  step  in  greater  detail. 
Though  the  visualization  of  the  multidimensional  scaling 
solution  space  is  more  informative  than  the  visualization 
of  the  data,  the  data-visualization  step  often  results  in 
interesting  revelations.  Figure  4  shows  the  ViSta-MDS 
data-visualization  screen.  The  ratings  from  each  judge  are 
plotted  against  the  ratings  of  every  other  judge  in  the  Scat- 
terplot  Matrix  at  the  upper  left  of  the  screen.  The  Scatter- 
plot  Matrix  serves  as  a  control  panel  for  the  visualization 
in  the  other  plots;  clicking  on  any  cell  of  the  Scatterplot 
Matrix  causes  a  larger  version  of  that  cell  to  appear  in  the 
Scatterplot  at  the  lower  left.  Clicking  on  any  two  cells  in 
the  same  row  or  column  of  the  Scatterplot  Matrix  causes 
the  three  dimensions  common  to  the  two  cells  to  appear  in 
the  SpinPlot  at  the  upper  center  of  the  screen.  Finally,  the 
Histogram  at  the  bottom  center  of  the  screen  reflects  the 
ratings  provided  by  the  judge  represented  by  the  row  of 
the  currently  selected  cell  of  the  Scatterplot  Matrix.  By 
examining  the  Histogram,  the  user  can  determine  whether 
a  particular  judge  is  inclined  to  give  extreme,  possibly 
biased,  dissimilarity  ratings.  By  examining  the  higher¬ 
dimensional  plots,  the  user  may  better  understand  the 
degree  to  which  judges  agree  with  each  other. 


After  the  data  are  visualized,  the  GuideMap  shown  in  Fig¬ 
ure  3  prompts  the  user  to  Report  Data  and  Summarize 
Data.  A  click  on  the  Report  Data  icon  produces  a  text 
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FIGURE  4.  The  spreadplot  provides  information  about  each  judge  in  the  data  set.  Judges'  ratings  may  be  viewed 
individually  or  may  be  compared  with  other  judges'  ratings. _ _ _ _ 


screen  showing  each  matrix  in  the  data  set,  labelled  by 
judge,  and  the  stimuli  on  which  the  ratings  were  given.  A 
click  on  the  Summarize  Data  icon  produces  standard  sum¬ 
mary  statistics  such  as  mean,  variance,  skewness,  kurtosis, 
range,  and  quartiles  for  each  matrix  in  the  data  set.  These 
sununary  statistics  provide  the  user  with  some  knowledge 
as  to  the  rating  style  of  each  judge.  By  presenting  both 
graphical  and  textual  displays  of  this  information,  ViSta- 
MDS  facilitates  understanding  of  multidimensional  scal¬ 
ing  data  by  a  wide  range  of  users. 


After  visualizing,  summarizing  and  reporting  the  data,  the 
GuideMap  for  exploring  data  looks  like  the  one  shown  in 
Figure  5.  The  button  highlighted  now  guides  the  user  to 
either  return  to  the  GuideMap  which  led  to  this  one  or  to 
create  data  from  a  subset  of  the  current  data.  By  choosing 
the  "Return"  option,  the  user  returns  to  the  GuideMap 
shown  in  Figure  6.  This  GuideMap  guides  the  user  to  per¬ 
form  a  multidimensional  scaling  analysis.  It  is  important 
to  realize  that  the  expert  author  of  the  GuideMap  has 
already  selected  desirable  options  for  a  multidimensional 
scaling  analysis;  thus,  when  a  novice  clicks  on  the  Multi¬ 
dimensional  Scaling  button,  the  computations  are  carried 
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FIGURE  5.  The  GuldeMap  for  exploring  data  after  the 
data  have  been  explored. 


out  with  the  set  of  options  provided  by  that  expert.  The 
novice  effectively  places  the  decisions  about  the  options  in 
the  analysis  in  the  hands  of  the  expert  author  of  the 
GuideMap.  When  the  analysis  is  complete,  the  highlight¬ 
ing  of  the  GuideMap  in  Figure  6  changes  so  that  the  user  is 
prompted  to  "Goto:  Model”.  When  this  button  is  clicked, 
the  GuideMap  for  modeling  data  (Figure  7)  appears. 


FIGURE  6.  The  basic  data  analysis  guidemap,  after  the 
data  are  explored,  now  guides  the  user  to 
perform  the  Multidimensional  Scaling. 


X 


FIGURE  7,  The  model-exploration  GuldeMap  prompts 
the  user  to  explore  the  multidimensional 
scaling  model  both  graphically  and  textually. 

In  this  GuideMap,  the  user  is  first  required  to  save  the  cur¬ 
rent  model;  next,  the  Interpret  Model  button  must  be 
clicked.  This  produces  a  text  window  that  contains  a 
description  of  the  various  components  of  the  multidimen¬ 
sional  scaling  model,  and  a  brief  summary  of  the  best  way 
to  examine  and  interpret  those  components.  In  order  to 
examine  the  components  of  the  multidimensional  scaling 
model  described  in  the  Interpret  Model  screen,  the  user 
must  use  the  Visualize  Model  and  Report  Model  buttons. 

The  Visualize  Model  button  produces  the  spreadplot 
shown  in  Figure  8.  The  Scatterplot  Matrix  shows  each  of 
the  dimensions  of  the  solution  space  plotted  against  every 
other  dimension  of  the  space.  The  Scatterplot  Matrix  is  the 
control  panel  for  the  Stimulus  Plane  and  Stimulus  Space 
plots  in  a  manner  analogous  to  that  of  the  Scatterplot 
Matrix,  Scatterplot  and  SpinPlot  in  the  data- visualization 
screen.  The  visualization  of  the  model  includes  a  scree 
plot,  showing  the  variance  accounted  for  by  each  dimen¬ 
sion  in  the  multidimensional  scaling  solution  space,  with  a 
vertical  line  indicating  the  dimensionality  of  the  current 
model.  The  Stress  Plot  at  the  lower  right  of  the  screen 
shows  the  value  of  the  stress  index  for  the  current  model. 
This  index  may  be  optimized  by  clicking  the  "Iterate"  but¬ 
ton  in  the  Stimulus  Space  plot,  as  described  in  McFarlane 
and  Young  (1994). 

The  Report  Model  icon  produces  a  text  screen  that  pro¬ 
vides  information  about  the  multidimensional  scaling 
model.  The  analyzed  matrix  is  shown,  along  with  the 
additive  constant  required  to  make  that  matrix  positive- 
definite.  The  initial  stimulus  coordinates  are  shown,  fol¬ 
lowed  by  the  current  stimulus  coordinates,  which  reflect 
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changes  made  by  either  iteration  or  visual  sensitivity  anal¬ 
ysis  (McFarlane  and  Young,  1994).  The  current  value  of 
the  stress  index  is  also  reported.  Again,  ViSta-MDS  pro¬ 
vides  both  graphical  and  textual  displays  of  information  to 
enhance  the  understanding  of  users  at  all  levels  of  exper¬ 
tise. 


3.0  Conclusion 


ViSta-MDS  is  a  testbed  for  visual  statistical  analysis  that 
is  still  under  development.  The  goal  of  the  software  is  to 
provide  novice,  sophisticated  and  expert  users  the  neces¬ 
sary  guidance  to  perform  appropriate  statistical  analyses. 
This  goal  is  reached  through  the  use  of  GuideMaps  and 
WorkMaps  that  provide  both  graphical  and  textual  dis¬ 
plays  of  information.  ViSta-MDS  also  facilitates  visual 
sensitivity  analysis  of  the  multidimensional  solution 
space,  as  described  in  McFarlane  and  Young  (1994).  It  is 
hoped  that  the  guidance  and  interactive  graphical  capabili¬ 
ties  provided  by  ViSta-MDS  will  lead  to  the  enhanced 
understanding  of  multidimensional  scaling  analysis  by  a 
variety  of  users. 
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Abstract:  In  this  paper  we  describe  a  new  statistical  envi¬ 
ronment  for  correspondence  analysis  which  incoiporates 
the  traditional  analysis  methods  with  cfynamic  graphical 
procedures.  We  make  use  of  algebraically  linked  plots  to 
visualize  the  solution  space  and  the  quality  of  representa¬ 
tion  under  various  dimensions.  We  also  introduce  interac¬ 
tive  graphical  modeling  as  a  complementary  tool  to  the 
traditional  algebraic  analysis,  which  allows  the  data  analyst 
to  modify  the  configuration  of  points  and  to  examine  the  re¬ 
sultant  effect. 


1  Introduction 


Correspondence  analysis  has  been  used  primarily  to  ana¬ 
lyze  two-way  contingency  tables,  in  which  the  observed 
associations  of  two  categorical  variables  are  summarized  by 
the  cell  frequencies.  The  name  is  a  translation  of  the 
French  Analyses  des  Correspondances,  where  the  term 
correspondances  denotes  a  “system  of  associations” 
between  the  elements  of  the  data. 

In  essence,  correspondence  analysis  performs  a  form  of  per¬ 
ceptual  mapping  similar  to  multidimensional  scaling, 
where  the  categories  are  represented  as  a  set  of  row  and 
column  points  in  the  multidimensional  space,  and  proxim¬ 
ity  indicates  the  level  of  association  among  the  row  or  col¬ 
umn  categories.  The  objective  is  to  represent  the  inter-point 
distances  in  a  smaller  dmensional  subspace — such  that  the 
original  distances  are  preserved  as  much  as  possible — for 
ease  of  visualization. 


Forrest  W.  Young 
Psychometric  Laboratory 
University  of  North  Carolina 
Chapel  Hill,  NC,USA 


To  illustrate  correspondence  analysis,  consider  the  multi¬ 
dimensional  time  series  on  the  number  of  science  doctor¬ 
ates  conferred  in  the  USA  from  1960  to  1975  that  is  shown 
in  Table  1  (Greenacre,  1984).  Correspondence  analysis  of 
these  data  yields  the  graphical  display  shown  in  Figure  1, 

In  Figure  I,  there  are  two  sets  of  points,  as  indicated  by  the 
two  types  of  point  i^bols.  The  points  are  row  points  for 
the  12  disciplines  (represented  by  crosses)  and  column 
points  for  the  8  years  (represented  by  disks).  Distances 
between  points  within  the  same  set  (row-to-row  and 
column-to-column)  are  defined  in  terms  of  chi-square  dis¬ 
tances,  which  can  be  interpreted  as  a  measure  of  similarity 
between  the  frequency  profiles.  For  example,  the  anthropol¬ 
ogy  degree  and  the  engineering  degree  are  far  fi’om  each 
other  because  their  profiles  are  different,  whereas  the 
mathematics  degree  is  near  the  engineering  degree  because 
their  profiles  are  similar.  On  the  other  hand,  distances  be¬ 
tween  points  of  different  sets  (row-to-column)  do  not 
approximate  any  defined  quantity  and  are  not  directly  com¬ 
parable.  The  interpretation  of  such  distances  is  governed  by 
the  batycentric  relationship  between  the  rows  and  columns 
(Greenacre  and  Hastie,  1987).  In  this  example,  each  disci¬ 
pline  point  lies  in  the  neighbourhood  of  the  year  in  which 
the  discipline’s  profile  is  prominent.  Thus,  there  are  rela¬ 
tively  more  chemistry  and  agriculture  degrees  in  1960, 
while  the  trend  from  1965  to  1975  appears  to  be  away  from 
the  physical  sciences. 

A  new  statistical  environment  for  correspondence  analysis 
has  been  created  in  ViSta  (Young,  1994),  called  ViSta-CA, 
which  incorporates  the  traditional  analysis  methods  of 
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correspondence  analysis  with  graphical  procedures.  Results 
of  correspondence  analysis  are  presented  visually  via  dy¬ 
namic  statistical  graphics,  the  purpose  being  to  help  the 
analyst  visually  explore  the  structure  of  the  geometric 
model. 


2  Algorithm _ 

Let  X  be  an  (n  X  m)  matrix  of  observed  frequencies  of  rank 
q  such  that  the  row  sums  and  column  sums  are  nonzero. 
Let  1  be  a  row  vector  of  ones  and  I  be  an  identity  matrix, 
each  of  appropriate  orders.  Denote  a  matrix-valued  function 
that  creates  a  diagonal  matrix  from  a  vector  by  diagi  ). 
Define 

i.  j  =  I'Xl  as  the  sum  of  all  elements  in  X; 

ii.  P  =  jX  as  the  matrix  of  relative  frequencies; 

iii.  r  =  PI  as  the  vector  of  row  masses; 

iv.  c  =  P'l  as  the  vector  of  column  masses; 

V.  Dr  =  diag(r)  as  a  diagonal  matrix  of  row  masses;  and 
vi.  Dr  =  diag(c)  as  a  diagonal  matrix  of  colunm  masses. 

The  generalized  singular  value  decomposition  (abbreviated 
SVD)  of  P  provides  the  required  solution  to  the  point  co¬ 
ordinates  of  correspondence  analysis: 

P  =  AD„B' 


where 

i.  A  is  an  (n  X  g)  matrix  whose  columns  are  the  left  gen¬ 
eralized  singular  vectors; 

ii.  Du  is  a  (9  X  q)  diagonal  matrix  of  generalized  singu¬ 
lar  values; 

iii.  B  is  an  (/n  X  9)  matrix  whose  columns  are  the  right 
generalized  singular  vectors;  and 

iv.  A'D;'A=B'D;'B  =  I. 

There  is  a  trivial  part  of  the  generalized  SVD  of  P  consist¬ 
ing  of  a  singular  value  of  1  and  the  associated  left  and  right 
singular  vectors  which  is  discarded  before  any  results  are 
displayed.  The  remaining  left  and  right  singular  vectors  de¬ 
fine  the  orthogonal  principal  axes  of  the  colunm  points  and 
row  points  respectively.  In  practice,  the  generalized  SVD  is 
computed  indirectly  by  performing  an  ordinary  SVD,  where 
the  ordinary  SVD  of  any  matrix  Q  is  given  by 

Q  =  UDaV' 

under  the  constraint  U'U  =  V'V  =  I.  Thus,  to  compute  the 
generalized  SVD  of  P,  we  perform  the  following  steps: 

i.  LetQ  =  D-''^PD;'  ^ 

ii.  Obtain  the  ordinary  SVD  of  Q,  giving  Q  =  UDaV'. 
ui.  Let  A  =  Di'^U,  B  =  Di'^ V,  and  D„  =  D. . 

iv.  Then  P  =  AD„B'  is  the  required  generalized  SVD. 


Table  1  Science  Doctorates  in  the  USA,  1960-1975 
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+ Engineering 

+ Mathematics 

4*  Others 

•  1973 

+  Physics 

•  1972  ,1971  *1970 

•  1965 

+ Sociology 
•  1974 

4- Biology 

+ Anthropology  *1975 

4*  Psychology 

4-  Economics 

4- Earth  Sciences 

4- Chemistry 

4- Agriculture 

•  1960 

Dimension  1 


Figure  1  Correspondence  Analysis  of  Science  Doctorates  Data 


The  row  coordinates  F  and  column  coordinates  G  are  then 
computed  according  to  the  appropriate  selection  of  the  for¬ 
mulas  in  Table  2.  Greenacre  (1984)  introduced  the  terms 
“principal”  and  “standard”  coordinates  to  distinguish 
between  the  two  most  common  normalizations  in  literature. 
Standard  coordinates  are  the  coordinates  D;'A  or  Dj'B 
having  unit  normalization,  while  principal  coordinates  are 
the  coordinates  D;' AD„  or  D,.!  BDa  having  weighted  sums 
of  squares  equal  to  the  squared  singular  values: 

F'D,F  =  DJ  G'D,G=D2 

The  joint  plot  of  the  rows  and  columns  in  k  dimensions, 
where  is  obtained  from  the  first  k 


columns  of  the  matrices  F  and  G.  A  symmetric  plot  dis¬ 
plays  both  the  row  points  and  column  points  in  principal 
coordinates,  whereas  an  asymmetric  plot  displays  one  set  of 
points  in  principal  coordinates  and  the  other  set  of  points  in 
standard  coordinates. 

The  squared  singular  values,  or  “principal  inertias”,  quan¬ 
tify  the  amount  of  variation  accounted  for  by  the  corre¬ 
sponding  principal  axes.  If  a  large  percentage  of  the  total 
inertia  lies  along  the  k  principal  axes,  it  means  that  the  chi- 
square  distances  among  row  profiles  and  among  column 
profiles  are  well  represented  along  these  axes.  Note  that  in 
an  asymmetric  plot,  the  principal  inertias  refer  only  to  the 
set  of  points  displayed  in  principal  coordinates. 


Table  2  Formulas  for  Coordinates 


Analysis  Options 

Row  Coordinates 

Column  Coordinates 

Analyze  Row  Profiles 
Analyze  Column  Profiles 
Analyze  Both 

F  =  D;'AD„ 

F  =  D;*A 

F  =  D;'AD„ 

G=D;'B 

G  =  D7'BD„ 

G  =  D;'BD„ 
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3  Statistical  Visualization _ 

The  statistical  visualization  of  correspondence  analysis  in 
ViSta-CA  presents  the  results  of  the  analysis  in  a  group  of 
interacting  plots,  called  spreadplot  (Young,  1994),  which  is 
based  on  the  notion  of  a  “graphical  spreadsheet”.  The  indi¬ 
vidual  plots  can  be  thought  of  as  “cells”  in  the  spreadsplot 
that  can  communicate  with  other  cells  via  equations  that 
define  their  relationships.  Figure  2  shows  the  spreadplot  for 
correspondence  analysis  of  the  Science  Doctorates  data. 

The  Spinplot  is  a  plot  of  the  row  and  column  points  in  the 
first  three  of  the  dimensions  selected  in  the  Dimensions 
window.  The  mouse  can  be  in  one  of  three  modes:  Spin¬ 
ning,  Brushing,  and  Selecting.  The  de&ult  mouse  mode  is 
Spinning.  In  this  mode,  the  cursor  looks  like  a  hand.  Hold¬ 
ing  the  mouse  button  down  and  moving  the  cursor  around 
the  plot  causes  the  plot  to  rotate.  If  you  first  hold  the  shift 
key  down,  then  the  plot  will  continue  to  rotate  when  you  let 
up  on  the  mouse  button.  You  can  also  make  the  plot  rotate 


by  using  the  Pitch,  Roll,  and  Yaw  buttons  at  the  bottom, 
l^en  you  place  the  mouse  mode  in  Brushing,  the  cursor 
looks  like  a  tiny  paint  brush  Avith  a  rectangle  attached  to  it 
Moving  the  brush  across  the  plot  selects  the  points  in  the 
rectangle  and  highlights  these  points.  When  the  mouse 
mode  is  changed  to  Selecting,  the  cursor  looks  like  an  ar¬ 
row  and  any  points  that  are  clicked  on  will  be  selected  and 
highlighted.  In  addition,  if  the  cursor  is  dragged  across  an 
area,  any  points  inside  the  area  are  also  selected  and  high¬ 
lighted.  Labels  of  selected  points  will  be  shown  in  whatever 
plots  are  linked  to  the  Spinplot.  With  the  Spinplot,  the 
analyst  can  search  for  fiiose  views  in  the  various  three- 
dimensional  perspectives  that  display  to  him  interesting 
stmcture  of  the  geometric  model. 

The  Scatterplot  plots  the  first  two  dimensions  that  are  se¬ 
lected  in  the  Dimensions  window.  This  plot  has  two 
mouse  modes — Brushing  and  Selecting — which  are  the 
same  as  those  modes  for  the  Spinplot.  The  information  in 
the  Scatterplot  was  displayed  in  Figure  1. 


Figure  2  Visualization  Spreadplot  for  Correspondence  Analysis 
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The  Rows  &  Columns  window,  which  contains  the 
labels  for  the  row  and  column  points,  is  useful  for  locating 
or  identifying  points  in  the  Spinplot,  Scatterpiot,  and 
Rssidual  Plot  windows.  Since  each  cell  frequency  corre¬ 
sponds  to  the  intersection  of  a  row  and  a  column  in  a  con¬ 
tingency  table,  when  more  than  two  labels  are  selected  or 
when  the  two  labels  belong  to  the  same  way  of  the  table,  the 
points  in  the  Residual  Plot  window  will  not  respond  to 
the  selection. 

The  Residual  Plot  is  a  plot  of  the  residuals  versus  the 
centered  observed  frequencies.  The  centered  data  are  calcu¬ 
lated  by  the  formula  P-rc'.  The  reconstitution  of  the 
correspondence  matrix  P  based  on  the  rank  k  weighted 
least  squares  approximation  is  given  by  the  formula 

P  = 

where  the  subscripts  [i]  refer  to  the  fact  that  only  k  of  the 
dimensions  are  involved  in  the  calculation.  The  specific 
columns  of  A(*]  and  B[*]  that  are  involved  correspond  to  the 
dimensions  selected  in  the  Dimensions  window,  which 
are  not  necessarily  the  first  k  singular  vectors.  The  specific 
diagonal  elements  of  D,^*]  are  the  associated  singular 
values.  The  residual  matrix  is  given  by 

(P-rcO-P. 

The  Residual  Plot  can  be  used  for  diagnostic  checking  as 
in  a  regression  analysis. 

The  Fit  Plot  is  a  plot  of  the  principal  inertias  against  each 
dimension,  showing  the  relative  amount  of  fit  for  each  di¬ 
mension  of  the  analysis.  It  serves  the  same  purpose  as  the 
scree  plot  in  principal  component  analysis. 

The  Dimensions  window  contains  a  list  of  dimensions.  It 
serves  as  a  control  panel  for  the  visuali2ations  in  the  Spin- 
plot,  Scatterpiot,  and  Residual  Plot  windows.  Select¬ 
ing  at  least  two  dimensions  will  change  the  current  display 
of  the  row  and  column  points  in  the  Scatterpiot  window 
to  that  formed  by  the  first  two  selected  dimensions.  For  ex¬ 
ample,  shift-clicking  Dimension  2,  Dimension  3,  and 
Dimension  5  produces  a  display  of  the  points  in  the  sec¬ 
ond  and  third  dimensions.  Selecting  three  or  more  dimen¬ 
sions  will  change  the  display  in  both  the  Spinplot  and 
Scatterpiot.  In  addition,  selections  in  the  Dimensions 
window  are  tantamount  to  a  re-specification  of  the  dimen¬ 
sionality  of  analysis — the  k  selected  dimensions  will  deter¬ 
mine  the  k  singular  vectors  from  the  matrices  A  and  B  and 
the  associated  singular  values  from  the  diagonal  matrix  D„ 
that  are  to  be  used  to  calculate  P  and  the  associated 


residuals.  The  residuals  will  be  updated  and  re-plotted  in 
the  Residual  Plot  window  to  reflect  the  change  in  fit 

When  the  visualizations  provided  by  the  spreadplot  is  com¬ 
bined  with  the  traditional  reporting  technique,  which  is 
also  available  in  ViSta-CA,  the  analyst  gains  a  greater 
understanding  of  the  results  of  correspondence  analysis 
than  when  either  technique  is  used  alone. 


4  Statistical  Re>Vision 


Statistical  re-vision  is  a  set  of  statistical  visualization  tools 
that  is  used  to  help  the  analyst  search  for  meaningful  and 
parsimonious  model  parameterizations.  In  ViSta-CA,  the 
analyst  is  able  to  move  row  or  column  points  to  new  loca¬ 
tions  which  may  be  more  “interpretable”,  but  which  no 
longer  satisfy  all  of  the  geometric  properties  of  correspon¬ 
dence  analysis. 

When  the  analyst  moves  a  point,  the  software  responds  by 
adjusting  the  positions  of  the  other  points  so  that  they 
approximate  the  correspondence  analysis  equations  as  welt 
as  possible.  For  example,  when  a  colunm  point  is  moved  by 
the  analyst,  the  software  calculates  new  positions  for  the 
row  points. 

The  calculations  of  the  new  positions  of  the  “other”  set  of 
points  is  done  so  that  the  basic  relationship  P  =  AD^B'  is 
maintained.  This  is  done  by  noting  that 

P  =  AD„B'  =  D,FG'D, 

when  the  normalization  is  asymmetric  (Analyze  Row  Pro¬ 
files  or  Analyze  Column  Profiles).  Efowever,  the  relation¬ 
ship  specified  by  the  equation  does  not  hold  when  the 
normalization  is  symmetric  (Analyze  Both),  which  is  why 
point-moving  is  not  possible  in  that  case. 

Understanding  of  the  statistical  re-vision  technique  may  be 
enhanced  through  the  use  of  examples.  To  this  end,  con¬ 
sider  the  spreadplot  for  the  correspondence  analysis  of  the 
Science  Doctorates  data  shown  in  Figure  3.  In  the  Spin- 
plot  and  Scatterpiot  windows,  the  column  points  are  dis¬ 
played  in  principal  coordinates;  the  row  points,  which  are 
represented  in  standard  coordinates,  are  masked  using  the 
Hide  Row  Points  menu  item  in  the  Scatterpiot  menu. 
When  normalization  is  asynunetric,  the  Scatterpiot  win¬ 
dow  supports  an  additional  mouse  mode — Point-kdoving.  In 
this  mode  the  cursor  looks  like  a  finger,  with  which  the 
analyst  can  move  a  colunm  point  by  clicking  on  the  point 
and  dragging  it  to  a  new  location. 
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The  two-dimensional  correspondence  plot  in  the  Scatter- 
plot  window,  which  accounts  for  approximately  95%  of  the 
total  inertia,  is  almost  an  exact  display  of  the  column  pro¬ 
files.  The  spread  of  the  column  points  along  the  first  axis. 
Dimens iorr  1,  indicates  a  deterministic  trend;  whereas 
the  second  axis.  Dimension  2,  is  difficult  to  interpret.  To 
fijcilitate  interpretation,  the  analyst  may  decide  to  move  the 
1960  year  point  in  a  way  such  that  the  distance  between 
1960  and  1965  is  approximately  equal  to  the  distance 
between  1965  and  1970,  to  reflect  the  five-year  gap  (note 
that  the  other  points  ate  separated  by  a  one-year  interval). 

When  the  year  point  1975js  moved,  the  column  coordi¬ 
nates  G  is  changed  to,  say,  G.  We  must  calculate  a  new  set 
of  row  coordinates  F  such  that 

P  =  D,FG'Dc. 

Note  that  FG'  =  D;'PD;‘. 


We  solve  for  F  by  the  equation 

F  =  D;'PD;>[^G(G'G)*'j. 

While  the  basic  relation  P  =  D,FG'D.  is  maintained,  the 
orthogonality  constraint  of  correspondence  analysis  may  be 
violated  since  the  left  singular  vectors  are  related  to  the  row 
coordinates  through  the  equation  A= D^’F.  The  “principal 
axes”  defined  by  the  new  set  of  “left  singular  vectors” 

A  =  D;‘F 

may  no  longer  satisfy  the  orthogonality  constraint 
A'D;'A  =  I. 

The  new  row  coordinates  are  displayed  in  both  the  Spin- 
plot  and  Scatterpiot  if  the  row  points  are  not  masked. 


Figure  3  Correspondence  Analysis  of  Science  Doctorates  Data— Asymmetric  Normalization  With 
Column  Points  In  Principal  Coordinates. 
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The  residuals  are  le-calculated  using  the  new  values  in  F 
and  G  to  update  the  Residual  Plot.  To  obtain  an  approxi¬ 
mate  measure  of  the  quality  of  fit  after  point  moving,  we 
calculate  a  new  set  of  “inertias”  by  the  equation 

d„  =  (a'a)‘auf 

and  plot  the  squared  diagonal  entries  against  each  dimen¬ 
sion  as  a  dashed  line  in  the  Fit  Plot  window.  Note  that 
since  the  orthogonality  con^aint  has  been  violated,  the 
squared  diagonal  entries  of  D„  will  overestimate  the  true 
inertias. 

The  results  of  moving  the  1960  column  point  is  presented 
visually  in  Figure  4.  Notice  that  in  the  Fit  Plot  window, 
the  inertia  along  the  second  axis  decreased,  reflecting  the 
&ct  that  the  variation  of  the  column  points  in  the  second 
dimension  has  been  reduced.  In  addition,  the  magnitude  of 
the  residuals  in  the  Residual  Plot  has  increased. 


5  Conclusion _ 

ViSta-CA  is  a  widely  applicable  tool  for  research  involving 
correspondence  analysis.  It  features  state-of-the-art  statisti¬ 
cal  visualization  techniques  for  exploring  the  structure  of 
the  geometric  model.  When  this  technique  is  combined 
with  the  traditional  reporting  techniques,  the  analyst  may 
gain  considerable  insight  into  the  multidimensional  proper¬ 
ties  of  his  data.  A  key  feature  of  ViSta-CA  is  statistical  re¬ 
vision,  which  allows  the  analyst  to  explore  for  a  model  that 
provides  a  better  interpretation  of  the  data  than  the  one  pro¬ 
vided  by  traditional  algebraic  analysis.  The  principle  be¬ 
hind  this  design  is  best  summarized  by  a  quotation  from 
Marriott  (1974); 

if  the  results  disagree  with  informed  opinion,  do  not  ad¬ 
mit  a  simple  logical  interpretation  and  do  not  show  up 
clearly  in  a  graphical  presentation,  they  are  probably 
wrong.  There  is  no  magic  about  numerical  methods... 


Figure  4  Statistical  Re-Vision  For  Correspondence  Analysis 
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Abstract 


Logistic  regression  is  the  accepted  parametric  method  for 
analyzing  data  with  continuous  predictors  and  a  binary 
response.  As  with  general  linear  models  the  relation 
between  the  predictors  and  the  logit  of  the  response 
probability  is  assumed  linear.  When  the  observed  response 
is  continuous,  visual  techniques,  such  as  scatterplots,  are 
useful  in  ascertaining  the  nature  of  this  relation,  but 
scatterplots  offer  little  information  when  the  observed 
response  is  binary.  A  system  offering  visual  model 
exploration  techniques,  derived  specifically  for  binary 
response  data,  is  proposed. 

1.0  Introduction 


Many  models  used  by  social  and  biological  statisticians  fall 
within  the  realm  of  generalized  linear  models.  These  models 
consist  of  three  components.  The  random  component, 
comprising  the  independent  observations  of  the  dependent 
variable  y,  the  systematic  component  which  is  the 
explanatory  model  0  =  XBixj  where  i  indexes  the  independent 
variables,  and  the  link  function  /(E(y))  =  0.  The  simplest 
link  is  the  identity  link,  /(E(y))  =  E(y). 

Correct  application  of  generalized  linear  models  includes  the 
assumptions  of  linearity  and  additivity.  The  linearity 
assumption  specifies  that  a  straight  line  describes  the 
relation  between  x\  and  0,  that  is  a  unit  change  in  xj  always 
yields  the  same  change  in  0.  Additivity  means  that  ^ere  are 
no  interactions;  a  change  in  any  xj  results  in  the  same 
change  in  0  independent  of  the  values  of  all  xj  i  9^  j. 

Violation  of  these  assumptions  will  lead  to  model 
misspecification.  Incorrectly  forcing  a  linear  additive  fit 


where  the  association  is  nonlinear  or  nonadditive  results  in 
incorrectly  large  error  and  systematically  biased  predicted 
values.  For  example,  if  the  true  relation  is  quadratic,  the 
linear  model  may  result  in  Bj  =  0,  indicating  that  there  is  no 
association  between  xi  and  y. 

The  methods  for  applying  the  generalized  linear  models  to 
continuous  response  variables  and  one  or  more  continuous 
or  categorical  predictor  variables  are  well  understood, 
especially  linear  models  solved  by  the  least  squares  normal 
equations.  The  solution  to  these  equations  yields  the 
unbiased  estimates  for  the  equation  E(y)  =  a  +  EBiXj . 

2.0  Binary  Responses _ 

A  binary  response  presents  problems  for  the  general  linear 
model.  Because  the  response  is  categorical,  the  normal 
equations  will  not  yield  reasonable  solutions  for  the 
regression  of  y  on  x.  Since  the  analyst  will  likely  be  more 
interested  in  the  probability  of  a  response  conditional  on 
having  observed  x  than  the  specific  value  of  the  outcome, 
we  consider  p(x)  =  p(y  =  llx),  the  probability  of  responding 
1  given  X,  as  the  response  variable  of  interest.  Clearly,  p(x) 
is  not  suitable  for  use  as  a  linear  model  response  variable  as 
it  is  bounded  by  0  and  1 .  A  linking  function  is  required  to 
transform  p(x)  to  a  variable  that  is  continuous,  unbounded 
and  may  reasonably  be  expected  to  have  linear  relation  with 
Xj.  Logistic  regression  employs  the  logit  link,  that  is  0  = 
log(p(xi)/(l  -p(xi))). 

3.0  Visualizing  GLM's _ 

If  one  is  unsure  of  the  shape  of  the  relation  between  y  and  xj 
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one  may  choose  to  construct  a  scatterplot.  This  allows 
visual  inspection  of  the  relation  and  may  suggest  that  a 
transformation  of  the  xj  variable  is  necessary ^to  effect  a 
linear  model.  That  is  the  model  E(y)  =  x;  +  Xj  or  E(y)  = 
log(xi)  may  better  describe  the  linear  model  than  would  E(y) 

=  Xi. 

3.1  Smoothing 

Another  choice  for  the  relation  between  xj  and  y  is  the 
function  E(y)  =  Sfxj)  where  S  is  a  nonparametric  function, 
such  as  a  smoother.  Here  the  predicted  value  of  y  is  found 
by  a  weighted  average  of  the  y's  for  observations  that  lie  in 
close  proximity  (in  the  x  space)  to  the  target  observation. 
Weighting  schemes  include  simple  means  as  well  as  linear 
and  polynomial  smoothers  (Cleveland,  1979;  Cleveland, 
Devlin  and  Grosse,  1988;  Cleveland  and  Devlin,  1988;  Fan, 
1992). 

As  in  the  continuous  response  case,  the  analyst  may  wish  to 
visualize  binary  response  data  in  conjunction  with  analysis. 
Unfortunately,  a  scatterplot  is  of  little  utility  when  the 
response  variable  is  binary  as  the  plot  will  merely  be  rows 
of  points  at  0  and  1.  A  solution  to  this  problem  is  found  in 
smoothing  (Copas,  1983),  where  the  smoothed  response 
variable  is  pfxj).  Here  smoothing  is  used  not  necessarily  to 
form  a  model  but  rather  to  visualize  the  shape  of  p(xj)  vs  Xj. 

If  the  logistic  regression  model  appears  to  fit  the  smoothed 
p(xj)  then  we  may  choose  that  model.  If  not  then  we  may 
wish  to  transform  the  xj  variable.  Transformations  of  an  Xj 
variable  may  not  be  immediately  suggested  by  the  shape  of 
S(xi).  Since  logistic  regression  assumes  a  linear  relation 
between  x;  and  logit(p(Xi)),  observing  the  plot  of 
logit(S(xi))  vs  Xj  may  prove  useful.  This  plot  may  be  used 
to  choose  some  transformation  of  xj,  in  much  the  same  way 
a  scatterplot  is  used  with  a  continuous  response  variable. 
As  in  the  continuous  outcome  case,  the  model  may  also  be 
defined  by  a  smooth. 

4.0  Visualization  For  Binary  Response 
Data  in  the  XLiso-Stat  Environment _ 

XLisp-Stat  provides  an  ideal  environment  for  implementing 
the  ideas  discussed  for  visualizing  models  with  continuous 
predictors  and  binary  responses.  Smoothing  is  a 
computationally  intensive  procedure  that  requires 
visualization  for  a  true  appreciation  and  understanding  of  the 
result.  XLisp-Stat  offers  both  the  computational  efficiency 
and  high  resolution  graphics  to  effectively  smooth  binary 
response  models. 


The  proposed  system  provides  two  stages  of  data  analysis. 
At  stage  1  the  user  smoothes  the  data  using  generalized 
additive  model  methods  (Hastie  and  Tibshirani,  1989).  The 
resulting  smooth  is  then  inspected  visually.  Visual 
techniques  include: 

1)  Plots  of  both  the  smoothed  probability  and  smoothed 
logistic  surface  with  predicted  values. 

2)  Residual  plots. 

3)  The  marginal  smooth  for  each  independent  variable. 

Statistics  indicating  the  importance  of  each  variable  in  the 
model  are  also  given. 

After  inspecting  the  smooth,  the  user  may  go  to  stage  2, 
fitting  a  parametric  model.  The  visual  parametric  techniques 
include: 

1)  Biplots  with  a  vector  indicating  the  relative  magnitude  of 
the  effect  of  each  variable  in  the  model. 

2)  The  predicted  response  surface  with  predicted  values. 

3)  Residual  plots  for  each  independent  variable. 

4)  Influence  plots. 

Parameter  estimates  and  standard  errors  are  also  included. 

The  user  may,  at  any  time,  alter  the  X  matrix  by  adding  or 
dropping  variables  or  transforming  variables.  The  result  of 
adding  a  transformed  variable  will  be  seen  in  the  predicted 
response  surface. 

5.0  Example _ _ _ _ 

Figure  1  shows  the  smooth  for  data  generated  by  the  model 
logit(y)  =  40*xi  +  0*x2  +  40*x3  +  0*X4.  The  upper  left 
plot  is  the  function  p  =  inverse  logit  [S(xi)  +  S(x4)];  the 
upper  middle  plot  is  the  function  y  =  S(xi)  +  S(x2).  Both 
plots  include  predicted  values  for  the  full  model  for  all 
observations.  At  the  lower  middle  is  the  residuals  plot  and 
at  the  lower  left  is  the  single  dimension  plot  of  X4.  All  of 
these  plots  are  dynamic  as  the  variables  viewed  are 
changeable.  The  upper  right  window  contains  observation 
names  while  the  lower  right  window  is  for  statistics.  The 
statistics  SSQ  and  %Total  indicate  the  contribution  of  each 
independent  variable  to  the  overall  variance  of  the  predicted 
logits,  but  assume  that  the  X  matrix  is  orthogonal. 


302  Visualized  Models  for  Binary  Response  Data 


Inspecting  the  various  plots  and  windows  indicates  that: 

1)  The  smooth  adequately  describes  the  data. 

2)  The  relation  between  each  xi  and  the  response 
logit(y)  is  linear. 

3)  The  variables  xl  and  x3  are  salient  while  x2  and  x4 
are  not. 

The  X  matrix  menu  option  may  be  used  to  remove  X2  and 
X4  from  the  X  matrix  and  a  parametric  model  is  fitted  using 
the  Model  menu  option  Parametric.  The  resulting  model 
is  shown  in  Figure  2.  The  plots  are,  clockwise  from  upper 
left,  a  biplot  of  independent  variables  with  the  parameter 
vector  added,  a  probability  function  plot  with  full  model 


predicted  values,  an  influence  plot  and  a  residuals  plot.  All 
plots  are  dynamic  in  that  the  variables  viewed  may  be 
changed.  The  observation  window  is  as  in  the  smooth 
model  and  the  statistics  window  contains  statistics  common 
for  a  logistic  regression  analysis. 

The  point  indicated  by  a  is  an  observed  1  that  had  a 
predicted  probability  near  0;  it  has  both  a  large  residual 
(lower  left)  and  a  large  effect  on  the  chi-square  (lower  right). 
Had  this  been  actual  data,  thsese  findings  could  indicate  that 
a  closer  inspection  of  this  observation  was  necessary. 


Figure  1.  A  smooth  model 
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Figure  2.  A  parametric  model 
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Abstract 


Principal  components  analysis  is  a  well  known  statistical 
model  used  to  approximate  a  high  dimensional  data  space  by 
a  subspace  of  lower  dimensionality.  Like  many  multivariate 
statistical  procedures,  when  the  principal  components  model 
is  fit  to  data  based  on  a  purely  algebraic  criteria,  it  can  be 
plagued  by  problems  of  data  sensitivity  and  interpretability. 
An  interactive  technique  called  visual  components  analysis 
is  proposed  as  one  solution  to  these  difficulties.  Visual  com¬ 
ponents  analysis  allows  a  user  to  visualize,  evaluate,  and 
modify  a  principal  components  model  within  a  unified 
graphical  environment.  It  is  believed  that  visual  components 
analysis  will  yield  subjectively  more  satisfying  solutions 
than  solutions  obtained  from  classical  algebraic  analyses. 

1.0  Motivation 


The  principal  component  model  is  a  well-known  statistical 
model  commonly  used  for  reducing  data  dimensionality, 
assessing  linear  relationships  in  data,  or  identifying  the 
“latent  dimensions”  presumed  to  underlie  observed  vari¬ 
ables.  It  is  written: 

X  =  U(AV)  =  UV'  (EQ1) 

where 

•  X  is  a  matrix  of  n  observations  measured  on  m  variables, 

•  Columns  of  U  are  unit  standardized  components, 

•  Matrix  A  is  the  diagonal  matrix  of  eigenvalues,  and 

•  Columns  of  V’  are  component  coefficients  (parameters 
of  the  principal  component  model)  with  crossproducts 
equal  to  the  eigenvalues. 


Traditionally,  the  principal  component  model  is  fit  through 
algebraic  equations  that  both  reflect  desired  data  characteris¬ 
tics,  as  well  as  possess  appropriate  analytic  properties.  How¬ 
ever,  a  variety  of  common  data  conditions  can  lead  awry  a 
principal  component  model  fit  by  algebraic  methods.  Exam¬ 
ples  of  such  conditions  include;  data  matrix  ill-conditioning, 
outliers,  leverage  points,  and  influential  observations  (Bar¬ 
nett  &  Lewis,  1984;  Jolliffe,  1986;  Belsley,  1991;  Critchley, 
1985;  Radhakrishnan  &  Kshirsagar,  1981). 

Algebraic  solutions  to  these  difficulties  have  previously  been 
proposed.  They  include  detection-based  strategies — ^that  is, 
find  the  problematic  variables,  observations,  or  model  char¬ 
acteristics  and  eliminate  them —  as  well  as  robust^  resistant 
and  local  fitting  methods.  Unfortunately,  the  robust/resistant 
estimators  and  local  fitting  methods  with  the  most  desirable 
properties  frequently  suffer  limitations  that  they  are  compu¬ 
tationally  intensive,  require  iterative  solutions,  and  contain 

arbitrary  constants^  that  substantially  affect  the  solutions 
they  ultimately  attain  (Belsley,  1991;  Cleveland,  1993; 
Cleveland,  Grosse  &  Shyu,  1991;  Huber,  1981). 

2.0  Statistical  Revision 


I  propose  an  dynamic,  user-interactive  approach  called  sta¬ 
tistical  re-vision  as  an  additional  solution  to  the  problems 
suffered  by  many  ordinary  algebraic  statistical  modeling 
techniques.  Statistical  re-vision  is  a  cyclic,  iterative  approach 
to  model  fitting  that  utilizes  the  analyst  as  an  active  element 
in  the  statistical  estimation  process.  Although  it  begins  with 
algebraically  optimal  model  parameter  estimates,  during  the 
course  of  statistical  re-vision  operations,  a  subjective,  aes¬ 
thetic  estimated  parameter  optimality  is  substituted  for  the 
initial  algebraic  criteria.  Specific  characteristics  of  the  sub- 


1.  E.g.  tuning  constants  or  bandwidths. 
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jective  criteria  are  determined  by  the  analyst  (Young,  Fal¬ 
dowski,  &  McFarlane,  1993). 

Statistical  re- vision  conducted  in  two  phases.  The  first  phase 
is  interactive  graphical  modeling.  In  it,  the  user  graphically 
modifies  the  estimated  parameters  of  the  model  by  moving  a 
representation  of  the  model  in  a  computer  display.  A  new  set 
of  subjectively  adjusted  parameter  estimates  (coefficients) 
and  predicted  values  for  the  model  are  produced  as  a  result. 
The  second  phase  is  interactive  graphical  exploration.  Here, 
the  analyst  explores  the  implications  of  his  subjective  adjust¬ 
ment  of  estimated  parameters  in  terms  of  fit,  and  he  has 
option  of  further  refining  his  choice  of  the  subjective  param¬ 
eter  estimates. 

When  statistical  re-vision  is  applied  in  the  context  of  the 
principal  component  model,  I  call  the  resulting  modeling 
process  visual  components  analysis  and  the  resulting  set  of 
components,  visual  components.  The  interactive  graphical 
modeling  phase  of  visual  components  analysis  consists  of 
user  modification  of  the  initial  component  coefficients  matrix 

(V’)  ,  resulting  in  a  set  of  subjective  component  coefficients 
(V*')  .  Based  on  the  new  set  of  coefficients,  a  corresponding 
set  of  subjective  components  U*  are  calculated. 

The  interactive  graphical  exploration  phase  in  visual  compo¬ 
nents  analysis  consists  of  graphical  exploration  of  the  com¬ 
ponents  and  coefficients  derived  from  the  altered  parameter 
estimates  using  structure  and  fit  plots.  It  also  entails  consid¬ 
eration  of  alternative  sets  of  coefficients  different  from  the 

initial  ones  (V’)  ,  but  not  as  extreme  as  those  specified  dur¬ 
ing  interactive  graphical  modeling  (V*')  .  The  interactive 
graphical  exploration  phase  of  statistical  re- vision  is  highly 
dynamic  with  plots  of  component  structure  and  fit  indices 
continually  updated  throughout  the  exploration  process. 

Note  that  when  statistical  re- vision  is  used  to  adjust  the  alge¬ 
braically  optimal  parameters  estimated  from  a  set  of  data, 
“subjective”  fit  increases,  but  objective  fit  virtually  always 
decreases.  In  addition,  it  is  often  necessary  to  violate  primary 
constraints  of  the  model.  For  example,  the  principal  compo¬ 
nents  model  provides  the  only  decomposition  of  a  data 
matrix  that  is  orthogonal  in  both  scores  and  coefficients. 
During  visual  components  analysis,  one  of  these  properties 
must  be  sacrificed.  Since  characteristics  of  variable  space 
were  assumed  of  primary  interest,  in  the  remaining  discus¬ 
sion  the  orthogonality  of  component  scores  was  selected  to 
be  maintained.  In  other  applications,  compelling  arguments 
might  be  made  for  choosing  the  alternative  constraint. 


3.0  A  System  for  Visual  Components 
Analysis _ 

Figure  2  shows  mock-ups  of  plots  from  a  statistical  graphics 
system  designed  to  support  visual  components  analysis.  It 
contains  two  general  types  of  plots: 

•  Structure  plots,  which  are  designed  to  show  the  structure 
of  the  data  and  model,  and 

•  Fit  plots,  which  are  designed  to  help  the  analyst  assess 
the  degree  to  which  a  component  model  objectively  fits 
the  data. 

Through  the  joint  use  of  these  plots  during  statistical  re¬ 
vision,  the  analyst  attempts  to  balance  the  subjective  quality 
of  the  structure  displayed  in  the  structure  plots  against  the 
objective  quantification  of  fit  relayed  by  the  fit  plots.  The 
visual  components  system  is  designed  to  help  the  analyst 
balance  trade-offs  between  subjective  and  objective  fit  as  he 
attempts  to  optimize  subjective  characteristics  of  the  compo¬ 
nents  solution. 

The  structure  plots  include  the  two  “BiPlot”  and  the  “Tour 
Plot”  windows,  which  present  the  structure  of  the  data  and 
model  as  classic  biplots  (Gabriel,  1972).  The  “TourPlot”  also 
serves  as  a  control  center  for  the  system.  It  manages  which 
space  (data  space,  model  space,  error  space,  or  interactive- 
graphical-exploration  space)  is  currently  visible  in  the  struc¬ 
ture  plots.  It  controls  whether  the  system  is  operating  in  visu¬ 
alization  or  statistical  re-vision  modes.  In  addition,  it 
supports  guided  tours  (Young,  Kent  &  Kuhfeld,  1988;  Buja, 
Asimov,  Hurley  &  McGill,  1988)  between  the  spaces  shown 
in  the  “BiPlot”  windows  and  provides  graphical  tools  for  use 
in  the  two  phases  of  statistical  re-vision.  The  “BiPlot”  win¬ 
dows,  meanwhile,  control  what  variables  are  displayed  in  the 
“TourPlot”  and  show  the  initial  and  target  spaces  for  guided 
tours  presented  in  the  “TourPlot”. 

The  “Scree  Plot”  is  a  standard  display  in  principal  compo¬ 
nents  analysis.  It  portrays  the  variances  of  the  components 
plotted  against  component  number.  The  “Variable-Model 
Variance  Trace  Plot”  shows  what  percent  of  each  variable’s 
variance  is  accounted  for  by  the  current  model  components. 
The  “Variable-Component  Variance  Trace  Plot”  shows  what 
percent  of  each  variable’s  variance  is  accounted  for  by  spe¬ 
cific  components  within  the  current  components  model. 

4.0  Interactive  Graphical  Modeling 

Although  interactive  graphical  modeling  for  visual  compo¬ 
nents  analysis  may  be  performed  in  either  the  model  or  data 
spaces,  for  illustrative  purposes  I  will  describe  it  in  the 
model  space.  To  begin  interactive  graphical  modeling,  the 
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analyst  switches  from  visualization  to  re- vision  mode  in  the 
“TourPlot”.  At  that  point,  he  gains  access  to  the  interactive 
graphical  modeling  tools  shown  at  the  bottom  of  the  “Tour- 
Plot”  window.  The  “Direct”  and  “Indirect”  buttons  describe 
two  ways  of  performing  interactive  graphical  modeling. 
Figure  1  shows  the  “TourPlot”  window  during  “Direct” 
interactive  graphical  modeling  mode.  In  the  left-hand  panel, 
note  that  the  cursor  has  changed  to  a  finger  which  the  analyst 
used  to  “grab”  one  of  the  component  vectors  in  the  display. 
He  then  orthogonally  rotated  the  component  vector  to  a  new 
location  among  stationary  representations  of  the  observa¬ 
tions  and  variables.  This  is  portrayed  in  the  right-hand  panel 
of  Figure  1. 

When  the  analyst  finds  a  suitable  new  location  for  the  com¬ 
ponent  vectors,  he  presses  the  “Compute”  button.  At  this 
point,  the  system  translates  the  user’s  graphical  rotation  into 
an  orthogonal  transformation  matrix,  R,  which  is  used  to 
define  a  new  set  of  adjusted  coefficients  and  components, 

V'*  and  u*  >  respectively.  The  components  model,  modified 
through  interactive  graphical  modeling,  may  be  written: 

X  =  (UR)  (R’V')  =  (U*)  (V*)  (EQ2) 

where 

•  R  equals  the  orthogonal  rotation  matrix, 

•  U*  is  the  new  graphically  altered  set  of  components,  and 

•  V'*  is  the  new  graphically  altered  set  of  coefficients 
(estimated  model  parameters). 

The  system  now  automatically  enters  the  second  phase  of 
visual  components  analysis,  interactive  graphical  explora¬ 
tion. 

5.0  Interactive  Graphical  Exploration 

As  the  system  enters  the  interactive  graphical  exploration 
phase  of  visual  components  analysis,  the  information  dis¬ 
played  in  the  structure  plots  (“BiPlotl”,  “BiPlot2”,  and 
“TourPlot”  windows)  change.  Regardless  of  what  space  the 
interactive  graphical  modeling  was  performed  in,  during 
interactive  graphical  exploration,  all  structure  plots  show 
model  spaces.  The  “BiPlotl”  window  displays  the  structure 
determined  from  the  initial  set  of  component  coefficients, 
while  the  “BiPlot2”  window  displays  the  structure  deter¬ 
mined  from  the  graphically-altered  component  coefficients. 
The  “TourPlot”,  meanwhile,  shows  the  structure  of  a  set  of 
components  determined  from  a  linear  combination  of  the 
initial  and  the  graphically-altered  coefficients. 


It  is  convenient  to  think  about  the  structure  shown  in  the 
“TourPlot”  window  during  interactive  graphical  exploration 
as  formed  by  conducting  a  guided-tour  between  the  compo¬ 
nents  represented  in  the  “BiPlotl”  window  and  the  corre¬ 
sponding  components  in  the  “BiPlot2”  window.  Each  step  in 
the  guided  tour  (a  trigonometric  interpolation  between  the 
model  spaces  shown  in  the  “BiPlotl”  and  “BiPlot2”  win¬ 
dows)  defines  an  alternative  composite  set  of  components 
and  parameter  estimates.  That  is: 


(cO50j)U+  (5//I0.)U 

(EQ3) 

(cosd{)\+  isind.)y* 

(EQ4) 

where 

•  U  and  V  are  the  initial  components  and  coefficients 
(defined  prior  to  interactive  graphical  modeling), 

*  4> 

•  U  and  V  are  subjective  components  and  coefficients 
(determined  through  interactive  graphical  modeling), 

•  U**  and  V**  are  the  alternative,  composite  set  of  compo¬ 
nents  and  coefficients  (determined  at  the  i^^  step  in  a 
guided  tour  rotation  during  interactive  graphical  explora¬ 
tion),  and 

•  is  the  cumulative  rotation  angle  on  the  i^*'  step, 
[O®<0j<9O°]  . 

Note  that  each  step  in  the  guided  tour  results  in  a  composite 
set  of  component  coefficients  different  from  the  initial  ones, 
but  less  extreme  than  those  determined  through  interactive 
graphical  modeling.  In  practice,  it  is  also  usually  necessary 
to  build  an  implicit  correction  factor  into  the  guided  tour 
rotation  in  order  to  maintain  component  orthogonality.  This 
detail  is  a  minor  technicality  that  does  not  substantively  alter 
the  nature  of  the  procedure. 

The  system  is  set  up  so  that  the  analyst  may  rotate  from  the 
initial  into  the  graphically  altered  components  and  back  as 
many  times  as  needed  to  fully  appreciate  the  effects  of  the 
graphical  alteration  and  to  determine  whether  an  intermedi¬ 
ate  set  of  coefficients  is  more  appropriate  or  not.  Throughout 
the  interactive  graphical  exploration  rotations,  all  of  the  fit 
displays  are  continually  updated  in  order  to  give  the  analyst  a 
sense  of  the  objective  quality  of  each  intermediate  set  of 
parameter  estimates.  At  any  point,  the  analyst  has  the  option 
of  stopping  the  rotations  and  updating  the  initial  components 
and  coefficients  with  those  from  the  currently  visible  com¬ 
posite  set. 
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6.0  Conclusion 


Statistical  re-vision  was  presented  as  the  framework  within 
which  visual  components  analysis  was  organized  and  it  pro¬ 
vided  the  structure  through  which  visual  component  model¬ 
ing  interactions  were  carried  out.  Over  a  number  of 
iterations  through  a  cycle  of  visualization  and  statistical  re¬ 
vision,  it  is  anticipated  that  the  analyst  will  generate  visual 
components  that  mitigate  many  of  the  effects  of  outlying  or 
influential  observations  in  the  component  solution,  that 
visual  components  should  more  closely  conform  with  the 
analyst’s  knowledge  about  his  substantive  research  problem, 
and  that  visual  components  analysis  will  yield  a  subjectively 
more  satisfying  solution  than  that  obtained  from  classical 
algebraic  component  analyses. 
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j  FIGURE  !•  A  View  of  Interactive  Graphical  Modeling  in  the  Component  Model  Space.  | 
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Abstract 

We  propose  compuiaiionally  feasible  diagnostics  within 
the  Bayesian  paradigm,  We  focus  on  the  detection  of 
influential  observations  and  the  assessment  of  the  sen¬ 
sitivity  of  the  analysis  to  prior  assumptions.  We  quan¬ 
tify  differences  in  the  inferential  conclusions  that  might 
be  drawn  under  modeling  conditions  that  depart  from 
an  assumed  setting  by  estimating,  via  a  Monte  Carlo 
approximation  based  on  a  single  draw  from  the  Gibbs 
Sampler,  the  Kullback-Leibler  divergence  of  the  baseline 
posterior  distribution  of  the  model  parameters  from  the 
alternative  posterior  distributions  obtained  by  deleting 
some  observations  or  by  altering  the  modeling  assump¬ 
tions.  We  illustrate  these  ideas  in  the  context  of  a  normal 
means  hierarchical  model. 

1  Introduction 

In  this  article  we  propose  compuiaiionally  feasible  diag¬ 
nostics  within  the  Bayesian  paradigm.  We  focus  on  two 
issues:  (a)  detection  of  influential  observations,  and  (b) 
assessment  of  the  sensitivity  of  the  analysis  to  prior  as¬ 
sumptions.  In  both  cases  we  wish  to  quantify  differences 
in  the  inferential  conclusions  that  might  be  drawn  un¬ 
der  modeling  conditions  that  depart  from  an  assumed 
setting.  We  do  so  by  measuring  the  Kullback-Leibler 
divergence  (Kullback  1959)  of  the  baseline  posterior  dis¬ 
tribution  of  the  model  parameters  from  the  alternative 
posterior  distributions  obtained  by  deleting  some  obser¬ 
vations  or  by  altering  the  modeling  assumptions. 

The  difficulty  with  such  an  approach  is  that,  in 
principle,  it  entails  reperforming  the  analysis  for 
each  dataset/model  considered.  Within  the  Bayesian 
framework  this  implies  repeated  evaluations  of  multi¬ 
dimensional  integrals  to  obtain  the  posterior  distribu¬ 
tions  of  the  model  parameters.  While  closed  form 
analytic  expressions  for  these  posterior  distributions 

*The  author  would  like  to  thank  Thomas  Santner,  Prem  Go  el, 
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are  available  for  simple  models  (DeGroot  1986;  Berger 
1985),  for  more  realistic  cases  either  numerical  quadra¬ 
ture  methods  (Smith  et  al.  1987),  asymptotic  approxima^ 
tions  (Walker  1969;  Tierney  and  Kadane  1986),  or  suc¬ 
cessive  substitution  sampling  techniques  (Gelfand  and 
Smith  1990;  Tanner  1991)  must  be  used.  The  majority 
of  these  methods,  with  the  exception  of  some  asymptotic 
approximations,  require  a  large  computational  effort. 

Similar  problems  do  not  occur,  for  example,  when 
computing  deletion  diagnostics — such  as  the  Cook’s 
distance — in  classical  linear  (or  generalized  linear)  mod¬ 
els  because  of  the  existence  of  exact  (approximate)  up¬ 
date  formulas  for  the  required  terms  (Cook  and  Weisberg 
1982).  Approaches  of  this  type  have  also  been  explored 
for  a  limited  number  of  Bayesian  problems  (Carlin  et 
al.  1992;  Kass  and  Vaidyanathan  1992;  McCulloch  1989; 
Tierney  et  al.  1989). 

The  methods  we  propose  in  this  article  have  similar 
goals.  We  assume  that  a  sample  from  the  baseline  pos¬ 
terior  distribution  of  the  model  parameters  can  be  gen¬ 
erated  through  the  Gibbs  Sampler  (Gelfand  and  Smith 
1990).  Expanding  on  ideas  of  Tanner  (1991,  p.  54), 
Gelfand  et  al.  (1992)  (who  consider  the  issue  of  model 
determination  from  a  predictive  viewpoint),  and  Smith 
and  Roberts  (1993),  we  estimate  the  Kullback-Leibler 
divergence  of  this  distribution  from  an  alternative  poste¬ 
rior  distribution  via  a  Monte  Carlo  approximation.  The 
various  terms  in  the  approximation  are  functions  of  the 
likelihood  ratios  of  the  two  distributions  evaluated  at  the 
different  points  in  the  sample. 

This  approach  has  the  desirable  property  that  the 
same  sample  from  the  baseline  posterior  distribution  can 
be  used  to  estimate  the  Kullback-Leibler  divergence  from 
several  alternative  posteriors.  Generation  of  one  sample 
via  the  Gibbs  Sampler  may  be  computationally  expen¬ 
sive  in  practical  situations.  By  circumventing  the  need 
to  redo  the  analysis  for  each  alternative  being  consid¬ 
ered,  the  proposed  approach  dramatically  reduces  the 
time  needed  to  identify  potentially  influential  observa¬ 
tions  and  to  probe  the  modeling  assumptions. 
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2  The  Gibbs  Sampler  and  the 
Gibbs  Stopper 

The  Gibbs  Sampler  is  a  successive  substitution  sampling 
scheme  that  allows  one  to  generate  samples  from  the 
joint  distribution  of  a  set  of  random  variables  having 
density  g{x)  =  g{xi,.,,,xd)  with  respect  to  a  domi¬ 
nating  measure  A  (a)  (Gelfand  and  Smith  1990;  Tanner 
1991).  In  Bayesian  statistical  applications  g{x)  =■  g{x\y) 
will  usually  be  the  posterior  probability  density  for  the 
model  parameters  x  conditional  on  the  observations  y, 
and  the  samples  will  be  used  to  estimate  functionals  of 
g{x).  The  algorithm  generates  a  path  }  of  a  Markov 
chain  whose  invariant  probability  distribution  coincides 
with  g{x).  Under  mild  regularity  conditions  the  iterative 
scheme  is  guaranteed  to  converge  in  the  sense  that,  if  j  is 
large  enough,  x^^^  can  be  regarded  as  a  realization  from 
g{x)  (Tierney  1991;  Schervish  and  Carlin  1992). 

Ritter  and  Tanner  (1992)  introduce  a  diagnostic  cri¬ 
terion,  called  the  Gibbs  Stopper,  to  assess  convergence 
in  practical  applications.  Denote  by  gj{x)  the  density  of 
the  distribution  of  the  chain  at  the  j-th  stage  of  the  itera¬ 
tive  procedure.  If  convergence  has  been  attained,  so  that 
gj{x)  is  “close”  to  g(x),  then  the  ratio  9{x)/gj{x)  should 
be  close  to  one  over  the  whole  range  of  possible  x  values. 
In  general,  the  target  density  g{x)  will  only  be  known 
up  to  a  renormalization  constant  C,  i.e.  g{x)  =  Cj{x), 
and  gj{x)  will  have  to  be  estimated.  Ritter  and  Tan¬ 
ner  (1992)  propose  an  estimate  gj(x)  given  by  a  Monte 

Carlo  sum  based  on  the  two  final  sets  of  draws  Xm 
and  x^\  m  =  1, . . . ,  M,  from  M  independent  paths  of 
the  Gibbs  Sampler  carried  out  to  depth  j. 

Upon  convergence  the  ratios 


should  be  concentrated  around  a  constant  value.  The 
Gibbs  Stopper  amounts  to  monitoring  the  ratios  in 
Equation  (1)  and  halting  the  algorithm  once  visual  in¬ 
spection  of  their  histograms  and  evaluation  of  some  func¬ 
tional  of  their  distribution  (e.g.  their  standard  deviation) 
indicate  that  they  have  stabilized  around  a  constant. 

We  will  refer  to  the  ratios  in  Equation  (1)  as  GS- 
weights.  Note  that,  for  a  fixed  number  of  cycles  j,  by  as¬ 
sociating  a  probability  Wg^m  —  to  each 

of  the  points  m  =  1, . . . ,  M,  one  can  regard  them 
as  a  sample  from  g{x)  instead  of  gj{x)  (Geweke  1989). 
In  the  sequel,  when  referring  to  a  sample  Xm  from  g{x) 
obtained  through  M  independent  replicates  of  j  cycles 
of  the  Gibbs  Sampler,  we  will  more  precisely  mean  a 
sample  from  gj(x)  reweighted  according  to  Wg^m> 


3  Monte  Carlo  Estimation  of 
the  Kullback-Leibler  Diver¬ 
gence  and  Bayesian  Analysis 

Let  two  distributions  have  densities  f  and  g  with  respect 
to  a  common  dominating  measure  A.  The  Kullback- 
Leibler  divergence  of  g  from  /  is  defined  as 

IC{f,9)  =  yiog(^)  fix)dX{x). 

The  use  of  the  Kullback-Leibler  divergence  to  evaluate 
discrepancies  between  distributions  in  an  attempt  to  as¬ 
sess  case  deletion  influence  and  sensitivity  to  prior  as¬ 
sumptions  is  well  documented  in  the  statistical  literature 
(Johnson  and  Geisser  1983;  McCulloch  1989;  Gelfand  et 
al.  1992).  Observe  that,  not  being  symmetric  in  its  ar¬ 
guments,  The  Kullback-Leibler  divergence  is  not  a  dis¬ 
tance. 

Throughout  the  section  we  will  denote  by  p{x)  = 
p{x\y)  the  posterior  density  for  the  model  parameters 
X  =  (xi, . . . ,  a:<i)  conditional  on  the  set  of  observa¬ 
tions  y  =  (2/1, . . . ,  J/n)-  We  first  discuss  a  comprehensive 
screening  method  for  identifying  influential  observations 
within  a  Bayesian  framework.  Let  7  be  a  subset  of  the 
integers  1  through  n  and  let  p\/(»)  =  P\i{^\y\i)  be  the 
posterior  density  for  the  model  parameters  x  conditional 
on  the  reduced  set  of  observations  y\^j  =  {y*  :  i  ^  7} , 
Denote  by  q{x)  =  q{x,y)  and  q\i{x)  =  q{x,y\j)  the 
joint  densities  of  (x,y)  and  (x,y\j)  respectively.  Then 
p(x)  =  q{x)/C  and  p\x{x)  =  Q\i{x)/C\j,  with  C  = 
f  q{x)dX{x)  and  C\x  =  /  q\x{x)  dX{x). 

Suppose  that  a  sample  from  p{x)  obtained  through 
the  Gibbs  Sampler  is  available  and  that  we  wish  to  de¬ 
termine  the  effect  that  the  presence  or  absence  of  indi¬ 
vidual  observations  has  on  our  inferential  conclusions  by 
means  of  comparison  between  the  posterior  distribution 
p,  conditional  on  the  entire  set  of  observations  y ,  and  the 
n  posterior  distributions  py/,  7  =  {i},  conditional  on  the 
71  reduced  subsets  of  observations  obtained  by  deleting 
observation  yi  in  turn. 

We  propose  to  employ  the  available  sample  fromp(a;) 
to  compute  Monte  Carlo  estimates  of  the  n  values  of  the 
Kullback-Leibler  divergence  /C(p\/,p)  of  p  from  each  of 
the  p\x-  This  is  done  in  an  attempt  to  obtain  a  mea¬ 
sure  of  the  effect  that  inclusion  of  the  2-th  observation 
would  have  on  our  inferences.  Large  divergence  values 
would  suggest  that  the  observation  has  small  likelihood 
under  the  assumed  model  and  was  possibly  generated  by 
a  stochastic  mechanism  that  differs  from  the  one  gener¬ 
ating  the  remainder  of  the  dataset. 

Suppose  then  that,  having  run  M  independent  Gibbs 
paths  to  depth  j  for  the  full  model,  we  have  draws 
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Xm  from  p{x)  together  with  their  associated  GS- weights 
for  m  =  1,...M,  We  show  in  Peruggia  (1994) 
that  a  Monte  Carlo  estimate  of  the  Kullbak-Leibler  di¬ 
vergence  JC{p\i,p)  is  given  by: 


M 

m=l 


where 


log 


M  \ 

V  I  ) 

m=l  / 

(2) 


^\/,m 


q{Xm) 


mz=:  (3) 


While  the  numerical  values  of  the  Kullback-Leibler  di¬ 
vergence  can  be  used  to  express  a  quantitative  judgment, 
the  ratios  in  Equation  (3)  can  be  used  to  make  a  graph¬ 
ical  assessment  of  influence.  In  fact,  if  p{x)  is  close  to 
P\/(®))  then  the  distribution  of  the  M  ratios  should  be 
concentrated  around  a  constant  value,  which  implies  that 
the  renormalized  ratios 


Wm  — 


m=l,...,M,  (4) 


should  be  concentrated  around  1/M.  Examination  of  the 
box-plot  of  the  set  of  weights  in  Equation  (4),  preferably 
after  having  applied  a  logarithmic  transformation,  can 
therefore  be  used  to  make  a  judgment. 

These  ideas  generalize  immediately  to  the  case  in 
which  one  is  concerned  with  the  influence  that  some  as¬ 
pects  of  the  modeling  assumptions  (in  particular  prior 
specification)  have  on  the  inferential  process  (robustness 
and  sensitivity  analysis).  Suppose  a  ‘^baseline”  specifica¬ 
tion  of  the  model  yields  the  joint  density  ^(a;,  y)  =  q{x) 
for  the  parameters  »  and  the  data  y  .  As  before,  we  can 
run  the  Gibbs  Sampler  for  this  model  and  obtain  M  inde¬ 
pendent  draws  Xm  from  the  posterior  distribution  p{x) 
with  their  associated  GS- weights  Wp^m-  Assume  further 
that  the  modification  of  some  aspects  of  the  model  leads 
to  the  alternative  joint  density  qA{^,y)  =  9i4(®)j  with 
corresponding  posterior  density  PA{^\y)  =  Pa{^)  for  the 
same  parameter  vector  x. 

Then,  once  the  set  of  ratios 


gyt(gm) 

9(»m)  ’ 


m  =  1, . . M,  (5) 


has  been  constructed,  the  analysis  can  proceed  as  be¬ 
fore.  In  particular,  we  can  examine  the  box-plot  of  the 
logarithms  of  the  renormalized  ratios  to  determine  how 
concentrated  they  are,  and  we  can  estimate  K{pa^p)  by: 


M 

ICa  =  -  log 

m=:l 


'  M  \ 

I  • 

^,fn=l  / 

(6) 


4  The  Normal  Means  Model 

We  illustrate  these  ideas  with  an  example.  Consider  the 
hierarchical  Normal  Means  Model  (Gelfand  and  Smith 
1990).  We  observe  Lk  data  points  from  the  /:-th  of 
K  normal  populations,  i.e.  ykj  ^  -^(^Jb)<r|),  for  k  = 
1, . . . ,  A”,  and  /  =  I, . . . ,  Ljk.  Conditional  on  the  param¬ 
eter  values  Ok  and  cr|,  the  observations  are  assumed  to 
be  independent  within  and  between  groups.  Further,  we 
assume  the  group  means  and  variances  to  be  indepen¬ 
dent  with  0k  ^  iV(/i, r^)  and  cr|  ^G(ai,6i)  (ai  and 
bi  known).  Finally,  we  assume  p  and  to  be  indepen¬ 
dent  with  p  ^  iV'(/zo,cr^),  and  ~  fG(a2,62)  (/^o, 

02  and  62  known).  In  the  notation  of  the  previous  sec¬ 
tion,  X  =  ({^ib},  a  (2  X  K  -f-  2)-dimensional 

parameter  vector,  and  y  =  {yk,i}- 

We  ran  our  experiment  using  simulated  observations. 
Data  y  was  generated  from  two  independent  normal  pop¬ 
ulations  {K  =  2):  the  first  sample,  of  size  Li  =  10,  from 
a  Ar(0, 1)  distribution,  and  the  second,  of  size  =  8, 
from  a  W (0.5, 1)  distribution.  We  completed  the  spec¬ 
ification  of  the  prior  distributions  by  setting  /io  =  0, 
(To  =  1,  ai  =  02  =  4,  and  61  =  62  =  0.333.  These  choices 
imply  that  both  (t^  and  have  mean  1  and  variance  0.5. 
Based  on  these  assumptions  we  performed  the  following 
influence  and  sensitivity  analyses. 


4.1  Influence 

In  order  to  illustrate  how  our  method  can  be  applied  to 
detect  influential  observations  we  artificially  introduced 
a  spurious  data  point.  Specifically,  we  shifted  1/2,1  by  6 
standard  deviations  to  the  left  of  its  observed  value  of 
—0.067,  setting  it  equal  to  —6.067.  We  then  ran  M  = 
100  independent  Gibbs  Sampler  paths  to  depth  j  =  200, 
thus  obtaining  100  draws  Xm  and  associated  GS-weights 
Wp^rn  from  the  posterior  distribution  p{x)  conditional  on 
the  18  observations,  and  assessed  convergence  using  the 
Gibbs  Stopper  criterion  of  Section  2. 

We  then  implemented  the  leave-one-out  strategy  for 
influence  detection  outlined  in  Section  3.  In  this  setting, 
if  we  take  I  =  {(i,  /)}  (i.e.  if  we  consider  removing  the 
/-th  observation  in  the  fe-th  group  from  the  dataset), 
we  obtain  the  following  functional  form  for  the  ratios 
in  Equation  (3): 


?\/(®m) 


m 


(7) 


where  y?(*|^,  <t^)  denotes  the  density  function  of  a  normal 
random  variable  with  mean  0  and  variance  cr^.  Adjacent 
box-plots  of  the  18  sets  (corresponding  to  all  I  =  {(fc,  /)}) 
of  100  ratios  defined  in  Equation  (7)  (after  renormaliza- 
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Figure  1:  Box-Plots  of  the  Logarithm  of  the  Leave-One- 
Out  Renormalized  Ratios 


Figure  2:  Sensitivity  of  Posterior  to  Prior  Specification 
of  (//o  =  O’ a) 


tion  and  transformation  on  the  logarithmic  scale)  are 
displayed  in  Figure  1. 

As  expected,  the  box-plot  of  the  set  of  ratios  cor¬ 
responding  to  /  =  {(2, 1)}  appears  strikingly  different 
from  the  others.  In  particular,  in  this  case,  there  is  a 
renormalized  ratio  as  large  as  0.465,  while  the  overall 
maximum  ratio  over  the  remaining  17  sets  belongs  to 
the  set  corresponding  to  /  =  {(1)4)}  and  equals  0.132. 
This  suggests  that  suppressing  observation  2/(2, i)  from 
the  data  will  exert  a  strong  influence.  More  precisely, 
the  posterior  distributions  for  the  parameters  x  given 
all  18  observations  and  given  all  18  observations  but 
the  (2,  l)-th  will  differ  significantly.  Visual  inspection 
of  the  box-plots  indicates  that  observations  2/(i,4)  and 
2/(1, 3)  niay  also  be  considered  mildly  influential. 

Next  we  used  Equation  (2)  to  compute  /C\/,  the  esti¬ 
mated  Kullback-Leibler  divergence  of  p  from  for  all 
1  {[kjl)}.  While  the  great  majority  of  the  estimated 

values  are  of  the  order  of  10”^,  /C\{(2,i)}  ^  2.8,  in  strong 
agreement  with  the  conclusions  we  had  already  drawn 
from  visual  inspection  of  the  box-plots.  Also  in  agree¬ 
ment  with  those  conclusions  is  the  fact  that  ^\{(i,3)} 
and  are  of  the  order  of  10”^.  Peruggia  (1994) 

contains  a  detailed  analysis  offering  evidence  of  the  con¬ 
siderable  location  shift  and  reduced  variability  in  the 
marginal  posterior  density  of  O2  induced  by  the  deletion 
of  the  outlying  observation  2/(2, 1)  =  —6.067. 

4.2  Sensitivity 

Now  we  illustrate  how  the  same  approach  can  be  em¬ 
ployed  to  perform  a  sensitivity  analysis.  For  the  original 
dataset,  we  probed  the  effect  of  varying  the  prior  spec¬ 
ification  of  the  mean  /iq  for  the  parameter  p  as  follows. 
Let  q{x)  denote  the  joint  density  of  (x^y)  corresponding 
to  the  “baseline”  specification  of  //q  =  0.  We  considered 
101  equally-spaced,  alternative  values  po  =  pA  in  the 
interval  [—5, 5].  Each  such  value  yielded  a  correspond¬ 
ing  joint  density  g>i(®)  for  (®)2/)-  We  used  the  Gibbs 


Sampler  to  generate  M  =  100  independent  observations 
Xm  and  corresponding  GS-weights  Wp^rn  from  the  poste¬ 
rior  distribution  having  density  p(x)  =  q{x)/Cy  where 
C  =  f  q(x)dx.  We  then  computed  the  101  sets  of  ratios 
corresponding  to  each  alternative  pA  according  to  Equa¬ 
tion  (5).  More  explicitly,  with  pm  denoting  the  M  values 
of  p  generated  via  the  Gibbs  Sampler,  we  computed 


qA  (^m) 
q{^m) 


m  =  1,. M, 


and  from  these  we  derived,  according  to  Equation  (6), 
the  estimated  Kullback-Leibler  divergence  Ka  of  p(x) 
from  Pa{^)  =  qA(^)/CA)  where  Ca  =  JqA{x)dXy  for 
the  101  alternative  values  of  pA  being  considered. 

The  plot  of  ICa  versus  pa  was  fairly  symmetric  around 
Pq  =  0,  with  a  rate  of  increase  only  slightly  higher  for 
positive  values  of  po  =  Pa*  Although  the  actual  numeri¬ 
cal  values  of  the  Kullback-Leibler  divergence  are  difficult 
to  interpret  directly,  it  appeared  that  a  prior  specifica¬ 
tion  of  ^0  =  0  when  the  “true”  value  of  po  is  some  other 
value  Pa  in  the  interval  [-2,2]  should  not  have  an  over¬ 
whelming  impact  on  the  resulting  posterior  distribution 
for  the  model  parameters. 

In  a  similar  manner,  and  with  little  additional  com¬ 
putational  burden,  it  is  possible  to  assess  the  effect  of 
varying  more  than  one  prior  parameter  at  a  time.  Fig¬ 
ure  2  illustrates  the  results  we  obtained  by  altering 
simultaneously  the  values  of  po  and  (Tq  in  the  speci¬ 
fication  of  the  prior  distribution  of  ju.  At  each  point 
(po  =:  Pj^^ctq  =  (t\)  ,  the  figure  displays  the  Kullback- 
Leibler  divergence  of  the  posterior  distribution  for  x 
arising  from  the  original  specification  (po  =  0,  ctq  =  l) 
from  the  one  arising  from  the  alternative  specifications 
{po  =  Pa,  Oq  =:  a^)  .  Darker  shades  of  gray  correspond 
to  larger  divergence  values,  as  indicated  by  the  gray-scale 
bar  on  the  right  hand  side  of  the  figure.  The  display  in¬ 
dicates  clearly  that  a  shift  in  the  prior  specification  of  po 
away  from  0  has  stronger  repercussions  on  the  inferential 
process  for  smaller  values  of  al . 
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It  is  intrinsically  difficult  to  evaluate  the  numerical 
values  of  the  Kullback-Leibler  divergence  on  an  abso¬ 
lute  scale  (see  for  instance  McCulloch  1989).  We  looked 
at  this  problem  from  the  point  of  view  of  equivalency 
between  model  specification  and  the  presence  of  influen  ¬ 
tial  observations.  Denote  by  p  the  posterior  distribution 
for  X  conditional  on  all  18  observation  y  when  y(2,i)  = 
-6.067,  and  by  p\{(2,i)}  the  posterior  for  x  arising  from 
the  same  model  after  removing  y(2,i)  l*he  analy¬ 
sis.  We  estimated  before  that  K  (p\{(2,i)}>p)  =  2.817. 
Observe  that  both  p  and  P\{(2,i)}  ^ire  based  on  a  prior 
specification  of  the  parameter  value  po  =  0. 

By  employing  the  proposed  Monte  Carlo  technique 
based  on  a  random  sample  from  p\{(2,i)},  we  estimated 
that  an  alternative  specification  «  —4.7  of  po 
would  yield  a  posterior  distribution  p\{(2,i)},;i^  for  which 
^  (p\{(2,i)},P\{(2,i)},;<4)  is  also  approximately  equal  to 
2.8.  In  other  words,  introducing  the  aberrant  observa¬ 
tion  2/(2, 1)  =  —6.067  into  the  analysis  has  the  same  effect 
on  the  posterior  distribution  for  x  (in  term  of  Kullback- 
Leibler  divergence)  as  moving  the  prior  specification  of 
Po  from  0  to  —4.7.  Thus,  if  we  consider  a  shift  from  0  to 
-4.7  in  our  prior  beliefs  about  po  to  be  important,  we 
should  also  attach  the  same  degree  of  relevance  to  the 
presence  of  the  outlying  observation  2/(2, i)  in  om  dataset. 

It  is  important  in  practical  applications  to  be  able  to 
assess  the  Monte  Carlo  variance  of  the  proposed  esti¬ 
mates  of  the  Kullback-Leibler  divergence  between  two 
distributions.  In  Peruggia  (1994)  we  discuss  this  issue 
and  illustrate  it  within  the  context  of  the  normal  means 
model  example. 
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Abstract 

We  generalize  the  linear  mixed-effects  model  introduced 
by  Laird  and  Ware  (1982)  to  include  random  change- 
points,  in  a  manner  similar  to  Stephens  (1994).  We  use 
a  fully  Bayesian  hierarchical  model  in  which  the  para¬ 
metric  forms  are  known  between  the  changepoints  and 
we  estimate  the  changepoints  and  model  parameters  us¬ 
ing  Gibbs  sampling.  These  techniques  are  applied  to 
investigate  prostate  specific  antigen  (PSA)  as  a  diagnos¬ 
tic  indicator  for  prostate  cancer  by  modeling  longitudi¬ 
nal  PSA  measurements  for  which  the  changepoint  is  the 
onset  of  cancer.  We  are  most  concerned  with  the  goal 
of  accurate  early  detection.  Diagnostic  rules  previously 
proposed  in  the  medical  literature  are  compared  with 
measures  based  on  the  posterior  probability  of  disease 
onset. 


gave  a  fully  Bayesian  hierarchical  analysis  of  change- 
point  problems,  including  the  use  of  the  Gibbs  sam¬ 
pler  to  solve  for  the  posterior  distributions  of  model  pa¬ 
rameters.  Stephens  (1994)  looked  at  continuously  dis¬ 
tributed  changepoints  and  multiple  changepoint  identi¬ 
fication  from  a  retrospective  point  of  view. 

We  describe  a  mixed-effects  model  with  linear  growth 
before  and  after  the  changepoint.  We  then  apply  the 
model  to  a  simulated  data  set  based  on  the  longitudi¬ 
nal  PSA  measurements  found  in  the  study  by  Carter  et 
al.  (1992).  We  perform  a  prospective  sequential  analysis 
to  see  how  quickly  this  method  identifies  a  changepoint 
after  it  occurs  and  compare  the  results  with  other  pro¬ 
posed  diagnostic  rules  using  receiver  operator  character¬ 
istic  (ROC)  curves. 

2  Hierarchical  model 


1  Introduction 

Laird  and  Ware  (1982)  introduced  a  family  of  mixed- 
effects  models  which  capture  the  serial  correlation  found 
in  longitudinal  data.  We  are  interested  in  modeling  lon¬ 
gitudinal  data  where  the  underlying  process  changes  at 
a  random  point  in  time.  We  extend  the  mixed-effects 
model  to  include  this  continuous  random  changepoint 
and  use  the  Gibbs  sampler  to  estimate  model  parame¬ 
ters  including  the  changepoint. 

There  is  a  great  deal  of  literature  on  identifying  when 
a  process  has  changed  and  estimating  the  changepoint. 
Page  (1955)  used  non-parametric  methods  to  test  the 
hypothesis  that  all  observations  are  from  the  same  dis¬ 
tribution.  Hinkley  (1969,  1970)  used  maximum  likeli¬ 
hood  estimation  to  identify  a  shift  in  process  mean  and 
the  intersection  of  a  two-phase  regression.  Smith  (1975) 
presented  a  Bayesian  approach  to  estimating  change- 
points  for  normal  and  binomial  distributions  along  with 
an  informal  sequential  procedure.  Carlin  et  al.  (1992) 


The  mixed-effects  model  for  linear  growth  before  and 
after  the  changepoint,  t,*,  can  be  written  as 

2/*i  “  ^oi  "f  4”  bi  {Xij  ^*)"*"  +  €ij  (1) 


where  yij  is  the  measured  value  for  subject  i  at  obser¬ 
vation  j,  and  Xij  is  the  time  of  observation  j  for  subject 
i.  The  index  i  takes  values  1, . . . ,  iST  and  j  takes  values 
1, . . . ,  n,*  when  there  are  N  subjects  in  the  study  and  the 
ith  has  Tii  observations.  The  complete  model  assumes 
the  following  distributions. 


(2) 


Wishart((pV^)  ^ ,  p) 
N(pp,or|) 
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1 

Gamma(A5,  rt) 

N(r,(T2) 

r  ^ 

1 

o  ^ 

Gamma(At,  n) 

N(0,.t2J 

1 

< 

Gamma(A£,  r^) 

The  prior  distributions 

for(“;),E<.,jd, 

are  assumed  known. 

The  Gibbs  sampler,  as  described  in  Gelfand  and 
Smith  (1990),  is  used  to  solve  for  the  posterior  distri¬ 
butions  of  the  model  parameters.  The  procedure  is  sim¬ 
ilar  to  that  in  Lange  et  al.  (1992),  but  with  a  continu¬ 
ous  changepoint  as  described  in  Stephens  (1994).  The 
complete  conditional  distributions  for  each  parameter, 
with  the  exception  of  the  {t*},  are  standard  paramet¬ 
ric  distributions  and  can  be  easily  sampled.  Although 
the  form  of  the  complete  conditional  distribution  for  ti 
changes  at  each  observation  point,  the  form  of  the  dis¬ 
tribution  between  observation  points  is  known.  Hence 
the  {ti}  can  be  generated  in  a  two  step  procedure  which 
first  generates  an  interval  and  then  generates  a  point 
within  that  interval.  Thus  it  is  straightforward  to  gener¬ 
ate  from  all  the  complete  conditional  distributions.  This 
procedure  leads  to  estimates  of  the  subject  specific  pa¬ 
rameters,  including  the  {U},  based  on  posterior  distribu¬ 
tions.  The  hierarchical  approach  permits  the  “borrowing 
of  strength”  from  the  population  to  estimate  the  individ¬ 
ual  parameters  while  accounting  for  the  within-subject 
serial  correlation. 

3  Application:  Prostate  disease 
and  PSA 

Prostate  cancer  is  the  second  leading  cause  of  cancer- 
related  deaths  among  American  males  (Pearson  et  al. 
1994).  Garnick  (1994)  discusses  the  prevalence  of 
prostate  cancer  and  the  dilemmas  associated  with  diag¬ 
nosis  and  treatment.  There  has  been  much  controversy 
over  the  benefits  and  the  possible  dangers  of  screening 
for  prostate  cancer.  In  this  application  we  do  not  address 
the  larger  question  of  whether  screening  should  be  per¬ 
formed,  but  look  at  a  methodology  that  could  be  used 
to  evaluate  diagnostic  rules  used  in  screening. 

Prostate  specific  antigen  (PSA)  is  a  glycoprotein  pro¬ 
duced  by  the  prostate  gland.  The  level  of  PSA  found  by  a 
blood  test  increases  with  the  volume  of  the  prostate.  The 


work  of  Catalona  et  al.  (1991,  1993)  supported  the  use¬ 
fulness  of  PSA  levels  as  a  diagnostic  marker  for  prostate 
cancer.  Gerber  (1991)  discussed  the  value  of  screen¬ 
ing  along  with  a  review  of  current  screening  methods. 
Oesterling  et  al.  (1993)  performed  a  prospective  study 
to  understand  the  link  between  PSA  and  age.  He  con¬ 
cluded  that  PSA  increases  gradually  with  age  in  normal 
men  and  suggested  normal  ranges  of  PSA  for  different 
age  groups. 

Carter  et  al.  (1992),  and  Pearson  et  al.  (1991,  1994) 
looked  at  serial  PSA  readings  on  men  over  a  period  of  7 
to  25  years.  They  used  a  mixed-effects  regression  model 
to  test  whether  the  changes  in  PSA  readings  were  dif¬ 
ferent  in  men  with  and  without  prostate  disease.  Model 
parameters  were  estimated  using  a  Newton-Raphson  re¬ 
stricted  maximum  likelihood  method.  Carter  et  al. 
(1992)  observed  that  PSA  increases  only  very  slowly  with 
age  before  the  onset  of  cancer  and  then  increases  more 
rapidly  when  cancer  is  present.  As  an  approximating 
model,  we  will  assume  that  it  is  the  square  root  of  the 
PSA  level  that  follows  the  linear  changepoint  model  (1). 

Our  work  is  motivated  by  longitudinal  readings  from 
the  Nutritional  Prevention  of  Cancer  Trial  (Abu-Libdeh 
et  al.  1990,  Clark  et  al.  1991).  Over  the  course  of  the 
trial,  participants  have  been  giving  blood  at  approximate 
six  month  intervals.  Of  these  participants,  some  have 
developed  prostate  cancer.  The  principal  investigator, 
Dr.  L.  C.  Clark,  plans  to  determine  the  PSA  levels  of 
the  frozen  blood  samples  from  subjects  with  and  without 
prostate  cancer  to  further  study  the  relationship  between 
PSA  levels  and  prostate  disease. 

We  present  results  for  an  analysis  based  on  simulated 
data.  These  data  represent  square  root  PSA  measure¬ 
ments  taken  annually  on  60  men  over  a  30  year  period, 
with  initial  ages  ranging  from  28.4  to  89.6  years.  First, 
random  intercepts  {ao%)  and  initial  slopes  {a*}  were  gen¬ 
erated.  Then,  for  30  of  these  subjects  (“cases”)  we  simu¬ 
lated  age-at-onset  times  by  generating  changepoints 
from  a  normal  distribution  with  a  mean  of  70  years  and 
a  standard  deviation  of  10  years.  For  those  subjects 
with  changepoints,  post-change  slopes  {fc,}  were  gener¬ 
ated.  Finally,  subject-specific  and  measurement  errors 
were  included  to  yield  simulated  square  root  readings 
{i/ij}.  The  parameters  used  for  the  simulation  were  de¬ 
rived  from  the  longitudinal  data  presented  by  Carter  et 
al.  (1992). 

We  now  analyze  this  simulated  data  set  using  the 
model  described  in  Section  2.  The  prior  distributions 
for  (“»),  13,  T,  (7•,■^and  are  listed  in  the 

Appendix  and  are  also  based  on  the  longitudinal  study 
described  in  Carter  et  al.  (1992).  We  take  t/,j  to  be  the 
square  root  of  the  PSA  reading  for  subject  i  at  obser- 
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vation  and  Xij  is  the  age  of  subject  i  at  observation 

f 

We  are  primarily  interested  in  sequentially  estimating 
the  marginal  posterior  distributions  for  the  {<*}  and  the 
{hi}.  Figure  1  shows  the  trajectories  for  the  square  root 
PSA  readings  for  one  of  the  30  simulated  cases  as  it 
evolves  over  time.  This  subject's  initial  reading  was  at 
age  48.3  years  and  the  changepoint  occurred  at  age  65.5. 
Figure  2  shows  the  evolution  of  the  posterior  distribution 
of  the  changepoint  ii  for  this  subject. 
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Figure  1:  Trajectory  of  a  typical  simulated  case.  Dot  at 
age  65.5  years  indicates  the  changepoint 


4  Comparison  of  diagnostic  rules 

Three  different  diagnostic  rules  or  criteria  have  been  sug¬ 
gested  for  use  in  screening  for  prostate  cancer  (Carter  et 
al.  1992).  The  first  is  based  on  a  normal  range,  whereby 
any  PSA  reading  above  a  threshold  value  (typically  4 
ng/ml)  is  considered  a  positive  test  result.  The  second 
and  third  diagnostic  rules  are  based  on  a  rate  of  increase 
over  a  given  time  period  (e.g.  1.0  ng/ml/year  over  one 
year  and  .75  ng/ml/year  over  a  two  year  period).  The 
formulation  we  have  proposed  leads  naturally  to  a  fourth 
rule.  At  the  time  of  the  current  test  for  a  particular 
subject,  we  compute  the  posterior  probability  that  the 
changepoint  has  already  occurred.  If  the  probability  ex¬ 
ceeds  some  specified  cutoff  value,  then  a  positive  result 
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Figure  2:  Posterior  distributions  of  the  changepoint  ti 
for  the  case  illustrated  in  Figure  1. 


is  indicated.  We  would  like  to  compare  these  four  sug¬ 
gested  criteria  —  threshold,  one  year  increase,  average 
two  year  increase  and  posterior  probability. 

A  standard  method  of  comparing  diagnostic  rules 
is  to  use  receiver  operator  characteristic  (ROC)  curves 
(Centor,  1991).  ROC  curves  plot  sensitivity  versus 
(1-specificity)  as  the  cutoff  value  for  the  given  criterion 
varies.  Specificity  is  defined  as  the  proportion  of  non- 
diseased  subjects  that  test  negative,  and  sensitivity  as 
the  proportion  of  diseased  subjects  that  test  positive. 
These  definitions  were  developed  for  a  single  test  and 
do  not  directly  apply  to  a  sequence  of  tests  taken  pe¬ 
riodically  over  time.  This  is  because,  with  longitudinal 
data,  a  single  subject  can  be  classified  as  a  false  positive 
at  one  observation  time  and  as  a  true  positive  at  a  later 
observation  time.  Murtaugh  et  al.  (1991)  discussed  ROC 
curves  for  repeated  markers.  They  classified  each  sub¬ 
ject  as  either  true  positive,  false  positive,  true  negative 
or  false  negative  using  the  series  of  observations,  thus 
effectively  reducing  the  problem  to  the  single  test  case. 

We  define  a  specificity  rate,  spec,*,  for  subject  i  as 

__  number  of  negative  tests  before  changepoint 

^  number  of  tests  before  changepoint 

An  estimate  of  population  specificity  is  obtained  by  av¬ 
eraging  the  subjects’  rates.  This  definition  weights  each 
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subject  in  the  sample  equally  and  incorporates  all  the 
data  available. 

We  use  a  different  approach  to  define  sensitivity  than 
we  do  specificity  for  two  reasons.  The  first  is  that  sensi~ 
tivity  is  time  dependent,  a  negative  result  ten  years  after 
the  changepoint  cannot  be  compared  with  a  negative  re¬ 
sult  within  two  years  of  the  changepoint.  Second,  a  true 
positive  result  ends  the  series  of  observations.  This  leads 
us  to  define  a  sensitivity  indexed  by  time,  K-period  sen- 
sitiviiy^  where  a  period  is  the  time  between  tests.  Here, 
for  convenience,  we  assume  the  same  period  for  all  sub¬ 
jects.  A  true  positive  is  a  subject  with  any  positive  test 
result  within  K  periods  after  the  changepoint,  and  a  false 
negative  is  a  subject  with  no  positive  test  results  within 
K  periods  after  the  changepoint.  /f-period  sensitivity  is 
the  proportion  of  diseased  subjects  that  test  positive  ai 
any  time  within  K  periods  after  onset. 

We  now  use  these  definitions  to  compare  the  four 
diagnostic  rules.  We  construct  ROC  curves  using  our 
simulated  data  for  which  the  period  is  one  year.  Fig¬ 
ure  3  shows  ROC  curves  for  four  different  values  of  K 
(A:=1,2,3,4).  The  curves  show  that  the  threshold  cri- 
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Figure  3:  ROC  Curves  For  Simulated  Data 


terion  is  inferior,  but  that  the  others  perform  similarly 
two  or  more  years  after  the  changepoint.  In  practice, 
one  may  choose  a  rule  by  first  identifying  an  acceptable 
level  for  specificity  and  then  selecting  the  rule  with  the 
highest  sensitivity.  For  our  simulated  data,  the  posterior 
probability  achieves  the  highest  sensitivity  for  specificity 
values  greater  than  95  percent. 
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Appendix:  Prior  Distributions 

We  list  here  the  prior  distributions  used  for  the  applica¬ 
tion  described  in  Section  3. 
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Abstract 

One  of  the  most  difficult  aspects  of  using  the  Gibbs  sam¬ 
pler  in  practice  is  knowing  when  to  stop  the  algorithm. 
In  order  to  answer  this  we  need  to  have  some  method 
which  will  tell  us  when  we  have  completed  enough  it¬ 
erations  for  the  chain  to  have  converged  sufficiently.  In 
this  paper  I  will  look  at  some  of  the  methods  that  have 
been  suggested  in  the  literature.  Most  of  these  meth¬ 
ods  require  input  from  the  user  throughout  the  length 
of  the  chain.  This  aspect  of  the  diagnostics  extends  the 
length  of  time  that  it  takes  for  the  algorithm  to  termi¬ 
nate  and  is  quite  tedious  for  the  user.  Ideally  one  would 
like  to  have  an  automatic  algorithm  which  would  test 
for  convergence  and  stop  the  Gibbs  sampler  when  it  is 
sufficiently  close  to  convergence.  I  will  look  at  some  of 
the  issues  involved  in  finding  such  a  diagnostic. 

1  Introduction 

Markov  (jhain  Monte  C'arlo  (MCM(J)  methods  have  re¬ 
cently  become  very  popular  tools  for  the  analysis  of 
Bayesian  posterior  distributions  of  relatively  high  dimen¬ 
sion.  The  simplest  of  these  algorithms  is  the  Gibbs  Sam¬ 
pler  which  was  introduced  by  Geman  and  Genian  (1984) 
in  the  context  of  image  processing.  It  was  then  applied 
to  Bayesian  problems  by  Gelfand  and  Smith  (1990)  and 
Gelfand  et  al  (1990).  With  this  method  we  set  up  a 
Markov  chain  which  has  the  posterior  distribution  of  in¬ 
terest  as  its  stationary  distribution.  Then  by  running 
the  chain  long  enough  we  can  sample  from  the  posterior 
and  so  make  inferences  about  it  by  simulation. 

The  major  problem  with  the  application  of  the  Gibbs 
sampler  is  that  it  is  very  hard  to  know  when  the  chain 
is  sufficiently  close  to  the  target  distribution  for  us  to 
use  it  for  inference.  In  some  cases  it  is  possible  to  calcu¬ 
late  bounds  on  the  total  variation  distance  between  the 
distribution  of  the  chain  after  n  iterates,  and  the 


target  distribution,  tt.  Then  we  can  find  out  a  priorihow 
many  iterations  we  need  in  order  to  make  this  distance 
as  small  as  we  like.  At  present,  however,  such  meth¬ 
ods  have  proved  successful  only  in  a  very  limited  class  of 
mathematically  tractable  models.  Also  the  bounds  are 
often  quite  loose  and  so  can  seriously  over-estimate  the 
number  of  iterations  required  to  convergence.  For  exam¬ 
ples  of  this  method  see  Rosenthal  (1991,  1993,  1994)  and 
Meyn  and  Tweedie  (1993). 

A  more  applied  approach  to  the  problem  is  to  use  the 
output  of  the  Gibbs  sampler  itself  to  assess  when  the 
chain  is  close  to  its  target  distribution.  It  is  only  nec¬ 
essary  to  run  one  implementation  of  the  Gibbs  sampler 
for  the  theoretical  convergence  results  of  MC'MG  to  hold, 
however,  it  is  often  very  difficult  to  differentiate  conver¬ 
gence  from  transient  behaviour  based  on  a  single  run. 
Gelrnan  and  Rubin  (1992)  gave  an  example  of  the  Ising 
model  in  which  they  ran  two  chains  from  different  start¬ 
ing  values.  Individually,  each  chain  appeared  to  have 
converged  well  after  2000  iterations  but  the  two  chains 
appeared  to  have  converged  to  different  distributions. 
Since  the  stationary  distribution  is  unique  for  the  Gibbs 
sampler,  it  is  clear  that  the  chains  had  not  actually  con¬ 
verged.  For  this  reason  I  believe  that  it  is  essential  for 
applied  Gibbs  sampling  that  a  number  of  independent 
chains,  each  with  the  required  stationary  distribution, 
are  used.  Then,  if  only  the  final  iterates  from  each  chain 
are  used,  we  have  an  iid  sample  from  the  target  distri¬ 
bution.  Even  with  multiple  chains,  the  problem  remains 
that  one  is  trying  to  assess  convergence  of  a  sequence  of 
d- dimensional  distributions  based  on  a  finite  .sample. 

2  Existing  Methods  for  Assess¬ 
ing  Convergence 

The  first  method  that  was  proposed  to  assess  conver¬ 
gence  Wets  the  Thick  Pen  method  (Gelfand  et  al.  1990). 
In  this  method  the  user  plots  successive  density  esti- 
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mates  of  the  univariate  variable  of  interest  and  claims 
that  convergence  has  been  achieved  when  the  density  es¬ 
timates  dilfer  by  only  a  very  small  amount.  Although 
this  method  appeared  to  work  for  some  models  it  was 
clear  that  a  better  diagnostic  was  needed  for  the  more 
complex  problems  which  it  was  hoped  the  Gibbs  sampler 
would  be  applied  to. 

More  recent  diagnostics  include  the  Gibbs  Stopper 
(Ritter  and  Tanner  1992).  This  method  was  one  of  the 
first  methods  to  try  to  assess  convergence  in  The 
method  is  based  on  importance  sampling.  After  n  iter¬ 
ates  of  the  ?7i  chains  we  estimate  the  current  approxima¬ 
tion  to  the  target  distribution  tt.  We  then  find  the 
importance  weights  evaluated  at  the  iterates  of  each 
chain. 


i(, .)(;{(”>) 


The  chains  are  assessed  to  have  converged  when  the  dis¬ 
tribution  of  the  weights  is  close  to  a  spike  at  1  (or  some 
constant  if  tt  is  not  normalized).  This  method  has  a 
number  of  disadvantages.  First,  it  requires  knowledge  of 
the  normalizing  constants  in  the  full  conditional  densi¬ 
ties  or  at  the  very  least  a  good  estimate  of  them.  In  most 
non-conjugate  models  such  constants  are  not  known  and 
the  estimation  process  is  very  time  consuming.  Secondly, 
the  assessment  of  convergence  is  very  subjective  and  re¬ 
quires  that  the  user  monitor  the  weight  distribution  for 
quite  a  while  before  being  able  to  say  that  the  weight 
distribution  is  close  to  a  spike.  Finally,  the  code  for  this 
method  is  highly  dependent  on  the  densities  involved. 
Hence  it  is  necessary  to  write  new  code  each  time  a  new 
model  is  used. 

Another  very  popular  convergence  diagnostic  was  pro¬ 
posed  by  Gelrnan  and  Rubin  (1992).  This  method  looks 
at  convergence  of  1-dimensional  variables  of  interest.  In 
this  method  each  of  the  ?7i  independent  chains  are  al¬ 
lowed  to  run  2?i  iterations.  The  first  ?i  iterates  are  then 
discarded  and  we  just  look  at  the  variable  of  interest  in 
the  second  n  iterates  of  each  chain.  An  ANOVA  type 
analysis  is  then  applied  to  this  iii  x  n  matrix  of  obser¬ 
vations.  The  algorithm  then  calculates  the  within  chain 
and  between  chain  variances  as  well  as  estimates  of  the 
overall  mean  and  variance.  This  method  assumes  that 
the  variable  of  interest  is  approximately  normally  dis¬ 
tributed  so  a  conservative  Students  t  distribution  is  used 
to  give  the  current  estimate  of  this  distribution.  Finally 
we  find  the  potential  scale  reduction  if  sampling  was  al¬ 
lowed  to  continue  to  infinity.  When  this  value  is  close 
to  1  convergence  is  said  to  have  been  achieved.  This 
method  has  the  advantage  that  we  get  a  numerical  value 
which  is  easier  to  assess.  Also  generic  code  is  available 
which  allows  the  method  to  be  used  on  the  output  of 


any  Gibbs  sampler.  One  drawback  is  the  assumption  of 
approximate  normality  of  the  variable  of  interest  and  the 
cases  that  are  of  more  applied  interest  are  those  where 
the  assumption  of  normality  is  not  justified. 

The  most  mathematically  sound  convergence  diagnos¬ 
tic  was  proposed  by  Roberts  (1992).  This  method  re¬ 
quires  that  we  run  a  reversible  Gibbs  sampler  which  in 
one  iteration  cycles  from  the  first  component  of  X  to  the 
last  and  then  from  the  last  component  back  to  the  first 
again.  Based  on  this  sampler  Roberts  defines  a  distri¬ 
butional  norm  such  that  ||7r(”)  —  7r||  ].  0.  He  then  con¬ 
structs  an  unbiased  estimator  of  1  -h  ||7r(”)-7r||.  The 
sequence  of  true  values  being  estimated  is  a  monotone 
sequence  with  1  as  its  limit,  hence  by  looking  at  the 
estimates  after  each  iteration  we  should  be  able  to  see 
if  convergence  is  indicated.  Note  that  if  the  normaliz¬ 
ing  constant  for  tt  is  not  known  then  the  convergence  is 
to  an  unknown  constant.  The  major  problem  with  this 
method  is  that  the  estimator  can  have  very  high  vari¬ 
ance  which  often  masks  the  monotone  convergence  of  the 
quantity  it  is  estimating.  We  must  also  know  the  nor¬ 
malizing  constants  for  the  full  conditionals  or  find  good 
approximations  to  them  in  order  to  calculate  the  esti¬ 
mate.  Once  again  this  method  requires  that  new  code 
be  written  for  each  new  problem. 

All  of  these  methods  require  some  sort  of  sul^jective 
assessment  by  the  user  as  to  when  the  chain  has  reached 
convergence.  In  practice  this  means  that  the  user  must 
monitor  these  diagnostics  while  the  Gibbs  sampler  is  run¬ 
ning.  Due  to  the  often  slow  convergence  of  the  Gibbs 
sampler  this  can  require  a  lot  of  interaction  between  the 
user  and  the  algorithm  and  so  take  a  lot  of  user  time  and 
also  slow  down  the  running  time  for  the  Gibbs  sampler. 
In  the  next  section  I  will  look  at  whether  it  is  possible 
to  reduce  this  user  interaction  in  the  convergence  assess¬ 
ment  process. 


3  Automating  the  Termination 
Procedure 

Ideally  one  would  like  a  totally  automatic  procedure 
which  would  run  the  Gibbs  sampler  without  any  user 
interaction  until  the  chains  had  converged  to  the  target 
distribution  and  then  return  a  sample  from  this  distribu¬ 
tion.  This  requires  a  convergence  diagnostic  which  can 
be  monitored  for  signs  of  convergence  by  a  computer 
algorithm  without  any  user  involvement.  Cfiearly  such 
a  convergence  diagnostic  would  need  to  be  totally  nu¬ 
merical  and  there  would  need  to  be  some  test  of  when 
the  value  of  the  convergence  diagnostic  indicates  con¬ 
vergence.  Unfortunately,  such  a  totally  automatic  algo- 
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rithrn  does  not  seem  to  be  possible.  It  is  very  easy  for  the 
Gibbs  sampler  to  become  stuck  in  areas  of  the  sample 
space  which  have  a  local  mode  for  long  periods  of  time 
and  if  all  the  chains  should  happen  to  become  stuck  at 
the  same  mode  then  any  convergence  diagnostic  would 
indicate  convergence  even  though  it  had  not  occurred 
yet.  It  should  be  possible,  however,  to  reduce  user  inter¬ 
action  by  having  an  algorithm  which  could  detect  when 
convergence  has  not  occurred  and  only  ask  for  user  in¬ 
put  when  the  diagnostic  indicates  that  convergence  may 
have  been  achieved. 

Suppose  that  the  user  runs  m  independent  chains  from 
an  initial  distribution.  The  initial  distribution  should  be 
over-dispersed  relative  to  the  target  distribution  so  that 
important  areas  of  the  target  distribution  are  not  missed. 
Let  Z  be  a  1-dimensional  variable  of  interest  which  is  a 
function  of  X,  Then  the  goal  is  to  have  the  algorithm 
continue  sampling  when  the  current  distribution  of  the 
variable  of  interest  is  not  the  true  target  distribution  of 
Z,  and  to  alert  the  user  when  it  may  be  at  the  correct 
distribution. 

If  we  assume  that  we  do  not  start  the  chains  from 
the  stationary  distribution  but  from  some  other  distribu¬ 
tion,  then  convergence  cannot  have  been  achieved  while 
the  chains  are  still  sampling  from  the  initial  distribu¬ 
tion.  Therefore  the  first  part  of  the  proposed  method  is 
to  continue  sampling  until  the  chains  appear  to  have  left 
the  initial  distribution.  In  order  to  test  this  we  need  only 
look  at  the  initial  sample  and  the  current  sample  of  final 
values  from  each  chain.  We  then  need  a  way  of  compar¬ 
ing  the  distributions  which  produced  these  two  samples. 
Since  the  distribution  of  Z^"^^  is  unknown  we  need  a  non- 
parametric  test  for  the  equality  of  two  distributions.  The 
usual  tests  do  not  appear  to  have  enough  power  to  detect 
the  small  differences  that  are  possible  and  so  I  have  con¬ 
structed  another  test.  This  test  is  more  powerful  than 
the  standard  tests  as  long  the  effective  support  of  the 
two  distributions  are  equal. 

The  test,  for  the  two  samples  z\^\.,.,Zm^  and 
Z\^\  . . . ,  Z^^  is  as  follows. 

1.  Fit  a  line  to  the  ^  ^  plot  of  the  two  samples.  Let 

be  the  estimates  of  the  intercept  and  slope 

respectively. 

2.  Under  the  hypothesis  that  the  two  samples  are  from 
the  same  distribution,  the  true  values  of  (cv,/i)  are 
(0, 1).  Hence  a  measure  of  the  difference  between 
the  distributions  would  be  the  distance  between  the 
estimates  and  (0,  1).  This  distance  is  given  by 

|(o!,/})|  =  +  (max(/},  l/fi)  -  if. 


3.  Now  take  b  pairs  of  bootstrap  samples  from  the  sam¬ 
ple  \  .  •  • ,  a,nd  repeat  the  first  two  steps 

for  each  pair.  Since  each  pair  is  a  pair  of  samples 
from  the  distribution  of  Z^^\  this  will  estimate  the 
variability  of  the  distance  measure  if  the  two  sam¬ 
ples  come  from  the  same  distribution. 

4.  If  less  than  5%  of  the  bootstrap  distances  are  larger 
than  the  observed  distance  then  we  can  conclude 
that  the  chains  have  left  the  initial  distribution.  If 
more  than  5%  of  the  bootstrap  distances  are  greater 
than  our  observed  distance  continue  sampling. 

It  is  inefficient  for  the  algorithm  to  test  departure  from 
the  initial  distribution  after  every  iteration  so  it  is  rec¬ 
ommended  that  it  do  the  test  every  gap  iterates  where 
gap  is  a  user  supplied  integer.  The  ideal  value  for  gap 
will  depend  on  the  initial  distribution  and  the  true  dis¬ 
tribution  of  Z. 

Once  the  algorithm  has  found  ii  such  that  the  distri¬ 
bution  of  Z^^^  is  significantly  different  from  the  distri¬ 
bution  of  Z^^^  it  can  start  testing  for  convergence,  in 
order  for  convergence  to  have  been  achieved,  it  is  nec¬ 
essary  that  all  m  chains  be  sampling  from  the  true  dis¬ 
tribution  of  Z.  Also  under  convergence  the  across  chain 
distribution  should  be  the  true  distribution.  The  algo¬ 
rithm  that  I  propose  will  test  if  all  chains  are  sampling 
from  the  same  distribution  and  if  this  is  also  the  across 
chain  distribution.  It  will  not,  however,  test  if  this  is 
the  true  distribution.  Such  a  test  would  require  that 
complex  code  be  written  for  every  new  model  and  every 
new  variable  of  interest.  If  all  chains  are  sampling  from 
the  same  distribution  then  it  is  probable  that  it  is  the 
true  distribution  but  the  assessment  of  this  is  left  to  the 
user's  knowledge  of  the  actual  model. 

In  order  to  look  for  possible  convergence  1  propose 
that  the  chains  be  allowed  to  run  a  further  n  iterations 
to  give  a  total  of  2n  iterates.  Then  we  can  compare 
the  m  within  chain  distributions  to  the  across  chain  dis¬ 
tribution  at  time  27i.  We  will  use  the  same  distance 
measure  as  before,  so  we  must  take  a  sample  of  size  m 
from  each  chain.  These  m  observations  should  be  taken 
from  the  second  half  of  the  chain.  I  have  used  equally 
spaced  iterates  between  7i  -h  1  and  27i.  (Jompare  each 

of  these  m  samples  with  the  sample  . . . , 

Then  define  the  maximum  squared  distance  to  be  the 
maximum  of  the  m  squared  distances.  For  the  boot¬ 
strap  part  take  the  pairs  of  bootstrap  samples  from  the 
across  chain  sample  at  time  27i.  The  bootstrap  distribu¬ 
tion  for  the  squared  distance  measure  for  samples  from 
is  all  that  is  needed  to  test  convergence  since  the 
m  chains  are  independent  and  under  the  null  hypothesis 
they  are  samples  from  the  same  distribution  as 
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Hence,  under  the  null  hypothesis  the  squared  distances 
are  iid,  and  so  the  quantiles  of  their  maximum  can  be 
derived  from  the  quantiles  of  the  distribution  of  squared 
distances  for  pairs  of  samples  from  Therefore  the 

null  hypothesis  of  convergence  should  be  rejected  if  the 
observed  p-value  is  less  than  1  ~~  a/O  95. 

If  the  null  hypothesis  of  convergence  is  rejected  at  iter¬ 
ation  27i,  then  the  algorithm  should  automatically  con¬ 
tinue  with  sampling.  For  small  n  it  is  possible  that  the 
null  hypothesis  was  rejected  because  the  chains  did  not 
have  sufficient  time  to  move  over  the  whole  sample  space. 
Therefore,  the  algorithm  should  now  double  the  length 
of  the  chain  again  and  in  the  next  test  use  the  final  2n  it¬ 
erates.  Each  time  the  chain  length  doubles  the  algorithm 
needs  to  store  twice  as  many  values  so  for  the  purposes  of 
efficiency  and  physical  storage  the  distance  between  tests 
cannot  be  doubled  indefinitely.  I  propose  that  the  chain 
length  be  doubled  after  each  test  until  the  time  between 
tests  is  large  enough  that  under  convergence  the  chains 
should  move  over  the  whole  sample  space  in  that  number 
of  iterates.  (Jlearly  this  will  depend  on  the  model  that 
is  being  used  and  also  will  be  determined  by  the  storage 
capacity  of  the  machine.  For  this  reason  I  feel  that  this 
maximum  number  of  iterates  between  tests  should  be  a 
user  supplied  value. 

Once  the  hypothesis  of  convergence  is  not  rejected, 
there  is  no  more  that  the  algorithm  can  say  at  that  point. 
The  user  should  then  be  notified  that  convergence  may 
have  been  achieved.  At  this  point  it  is  up  to  the  user  to 
see  whether  convergence  has  actually  been  achieved  or 
if  the  chains  are  simply  stuck  in  a  portion  of  the  sample 
space.  The  easiest  way  of  doing  this  is  to  plot  the  sample 
paths  of  each  chain  over  the  iterates  on  which  the  final 
convergence  test  was  based.  All  of  these  plots  should  be 
similar  and  they  should  all  cover  the  important  areas  of 
the  sample  space.  At  this  point  one  could  also  try  the 
Gelman  and  Rubin  test  on  the  same  matrix  of  obser¬ 
vations  as  used  by  the  final  test.  If  both  of  these  user 
checks  seem  to  confirm  the  result  that  convergence  has 
been  achieved  then  it  is  probably  safe  to  use  the  chains 
for  inference. 

For  this  method,  and  most  convergence  diagnostics,  to 
succeed  it  is  very  important  that  the  initial  distribution 
be  chosen  to  cover  the  complete  effective  sample  space. 

If  all  of  the  chains  are  started  near  a  local  mode  then  it 
is  likely  that  the  algorithm  will  assess  convergence  much 
sooner  than  is  correct  since  each  of  the  chains  will  tend 
to  stay  near  the  local  mode  and  so  all  the  chains  will 
have  the  same  distribution  as  the  across  chain  sample 
but  it  not  the  correct  target  distribution.  This  situation 
can  also  arise  due  to  chance  if  the  starting  values  are 
selected  at  random.  For  this  reason  it  is  vital  that  the 


user  have  some  idea  of  what  the  sample  space  is  and  to 
make  sure  that  this  whole  area  is  covered  by  all  of  the 
chains  when  the  algorithm  does  not  reject  the  hypothe¬ 
sis  of  convergence.  One  way  of  avoiding  a  bad  random 
sample  is  to  select  the  starting  points  systematically  to 
cover  an  area  which  is  larger  than  the  sample  space  of 
the  target  distribution.  This  is  the  method  that  many 
users  of  Gibbs  sampling  actually  use  in  practice  to  find 
starting  values  for  the  chains.  The  proposed  algorithm 
will  still  work  for  starting  values  chosen  in  this  way,  the 
only  difference  being  that  for  such  starting  values  de¬ 
viation  from  the  initial  distribution  would  be  detected 
sooner  and  so  the  second  phase  of  the  algorithm  would 
start  earlier  than  for  starting  values  chosen  from  a  dis¬ 
tribution  which  approximates  the  target  better.  This 
difference  does  not  appear  to  affect  the  number  of  itera¬ 
tions  before  the  algorithm  detects  possible  convergence. 
Since  the  algorithm  is  totally  free  of  any  distributional 
assumptions,  code  can  be  written  to  do  the  testing  which 
can  then  be  applied  to  any  situation.  All  that  is  required 
is  the  ability  to  incorporate  the  code  for  the  convergence 
testing  into  the  code  for  the  Gibbs  sampling,  and  there 
will  need  to  be  some  global  variables  which  keep  track  of 
whether  the  algorithm  is  testing  for  deviation  from  the 
initial  distribution  or  testing  for  convergence,  and  when 
the  next  test  is  due.  The  extra  time  that  is  used  by 
the  algorithm  to  complete  the  required  tests  is  not  pro¬ 
hibitive  to  its  use  in  practice  as  long  as  gap  is  chosen  well 
and  the  number  of  bootstrap  samples  is  not  excessive.  In 
most  cases  I  have  found  that  about  1000  bootstrap  pairs 
is  sufficient. 

4  Examples 

Here  I  will  present  2  examples  on  which  I  used  this  algo¬ 
rithm.  Both  of  them  are  examples  where  previous  con¬ 
vergence  diagnostics  have  had  trouble.  In  both  of  these 
cases  I  ran  25  independent  chains  and  for  each  test  I  used 
1000  pairs  of  bootstrap  samples.  For  the  second  stage  of 
the  convergence  test  the  algorithm  requires  an  observed 
p- value  of  greater  than  1  -  VO.95  =  0.0021  in  order  to 
not  reject  the  hypothesis  of  convergence. 

Example  1 

For  the  first  example  I  used  an  equal  mixture  of  bivariate 
normals.  The  distributions  were  centered  at  =  (0, 0) 
and  p2  =  (4,4).  In  both  cases  the  covariance  matrix 
was  the  identity  matrix.  The  variable  of  interest  was 
the  first  component  of  X.  For  starting  values  I  took 
25  equally  spaced  points  along  the  line  x  =  y  between 
(-4,-4)  and  (8,8).  For  the  first  stage  of  the  algorithm 
I  tested  deviation  from  the  initial  distribution  every  50 
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Table  1;  (Jonvergence  test  for  example  1 


Iteration 

Max  Squared 
Distance 

Bootstrap 
p- value 

200 

8.447 

0 

400 

5.916 

0.002 

800 

6.307 

0 

1600 

2.872 

0.009 

iterations. 

It  took  only  100  iterations  for  the  algorithm  to  de¬ 
tect  deviation  from  the  initial  distribution  and  then  took 
a  further  1500  iterations  until  the  first  time  that  con¬ 
vergence  was  indicated.  Plots  of  the  final  800  sampled 
values  of  the  variable  of  interest  showed  that  all  chains 
moved  between  the  two  modes  with  approximately  the 
correct  frequencies  at  each  mode.  The  convergence  tests 
are  summarized  in  table  1 .  In  this  example  the  algorithm 
quickly  detected  deviation  froiTi  the  initial  distribution, 
but  this  simply  meant  that  the  points  were  no  longer 
evenly  spread.  All  that  had  happened  by  100  iterations 
was  that  the  chains  had  moved  towards  the  closest  mode 
and  so  there  were  two  groups  of  chains.  It  then  took  a 
relatively  long  time  for  all  the  chains  to  move  to  conver¬ 
gence. 


Example  2 

For  the  second  example  I  used  the  ‘^Witch’s  Hat”  distri¬ 
bution  given  by 


~  I  2(t2  I  +  f"- 


For  this  example  1  took  d  =  8,/fi  =  0.7;  i  =  l,...,d, 
cr  =  0.03  and  6  =  10"^^  This  distribution  has  a  very 
sharp  peak  at  the  point  ft  and  a  flat  brim  on  the  rest  of 
the  set  [0,  i]"^.  Since  the  effective  sample  space  is  [0,  l]^,  I 
used  the  uniform  distribution  over  this  set  as  my  initial 
distribution.  The  variable  of  interest  that  I  looked  at 
was  the  first  component  of  X  again. 

In  this  case  it  took  600  iterations  before  the  algorithm 
detected  deviation  from  the  initial  uniform  distribution. 
The  results  of  the  second  stage  of  the  algorithm  are  in 
table  2.  Convergence  is  suggested  after  only  two  tests  in 
this  case  and  sample  path  plots  of  the  first  component 
from  iteration  1201  to  2400  showed  that  all  25  chains 
were  sampling  from  the  spike  and  so  convergence  could 
be  assumed. 


Table  2:  Convergence  test  for  example  2 


Iteration 

Max  Squared 
Distance 

Bootstrap 
p- value 

1200 

2400 

39.9537 

0.3588 

0 

0.052 
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ABSTRACT 

In  this  paper,  the  likelihood  method 
for  confidence  intervals  estimation  of 
traffic  intensity  for  the  M/M/1  queueing 
system  is  investigated  for  the  priority 
queues.  The  approximate  confidence 
interval  formulae  of  the  mean  queue 
lengths  are  derived  for  the  M/M/I  prior¬ 
ity  queueing  system.  Numerical  exam¬ 
ples  illustrate  the  above  techniques. 

1.  Introduction 

The  idea  of  statistical  analysis  to 
queueing  data  dates  back  to  Clarke’s 
(1957)  paper  where  he  estimated  ±e 
parameters  for  a  simple  M/M/1  queueing 
system  using  the  principles  of  the  max¬ 
imum  likelihood.  Later,  Lilliefors  (1966) 
examined  the  problem  of  finding  the 
confidence  interval  for  the  traffic  inten- 

sity  p  (p  =  —  is  the  ratio  of  mean  arrival 

rate  to  mean  service  rate).  He  used  the 
estimates  of  traffic  intensity  to  obtain  the 
confidence  intervals  for  the  expected 
number  of  units  in  the  system.  A  direct 
approach  based  on  the  number  of  arrivals 
during  the  nth  service  period  for  the 
M/Ek/1  queue  discussed  by  Bhat  and  Rao 
(1987),  was  used  by  Jain  (1991)  to  obtain 
confidence  intervals  for  the  trjtffic  inten¬ 
sity  p.  This  paper  addresses  the  problem 
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of  priority  queue  in  which  the  service  dis¬ 
cipline  is  first  come,  first  served;  however 
the  service  could  be  interrupted  if  a  cus¬ 
tomer  with  priority  arrived.  Obviously, 
the  problem  of  modelling  priority  queues 
are  generally  more  difficult  to  handle  [see 
Gross  and  Harris  (1985)].  However,  it  is 
that  the  real-life  queueing  situations  for 
priority  considerations  are  often  required 
(viz.  in  an  emergency  department  of  a 
hospital  and  post  office  etc.). 

In  this  paper,  we  consider  the  situa¬ 
tion  where  the  highest  priority  customer 
is  allowed  to  enter  the  service  immedi¬ 
ately  even  if  another  customer  with  lower 
priority  is  already  present  in  the  service 
when  the  higher  priority  customer  arrives 
to  the  system.  Such  system  is  called 
preemptive  priority  queueing  system  [see 
Gross  and  Harris  (1985)].  The  object  of 
this  paper  is  to  estimate  the  parameters  of 
such  a  system  and  then  to  obtain 
confidence  intervals  formulae  of 
expected  number  of  queue  length  for  the 
priority  and  nonpriority  customers.  In 
section  2,  preliminary  results  concerning 
the  estimation  of  parameters  for  M/M/1 
queueing  systems  are  given.  Section  3 
deals  with  the  parameters  estimation  pro¬ 
cedures  for  the  priority  and  non-priority 
customers.  The  approximate  confidence 
intervals  formulae  for  the  mean  queue 
length  are  obtained  for  the  M/M/1  prior¬ 
ity  queues.  Finally,  numerical  pro¬ 
cedures  are  illustrated  with  an  example. 

2.  Preliminary  Results 
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Clarke  (1957)  considered  the 
M/M/1  queue,  in  which  customers’arrival 
form  a  Poisson  process  with  parameter  X, 
and  service  times  of  the  customers  are 
independent  and  identically  exponen¬ 
tially  distributed  random  variables  with 
mean  The  ratio  p  =  X41  is  called  the 
traffic  intensity.  The  maximum  likeli¬ 
hood  estimates  of  the  parameters  A,  and  |i 
are  given  by 


and 


Where,  the  system  is  observed  for  a 
fixed  interval  of  time  duration  t.  During 
this  time  period,  there  are  Na  arrivals,  Ng 
service  completions  and  the  service  facil¬ 
ity  is  busy  for  tf,  time  units, 

Lilliefors  (1966)  has  considered  the 
problem  of  finding  the  confidence  inter¬ 
vals  for  the  actual  M/M/1  traffic  intensity 


given  by 

A, 

p  =  — . 

(2.3) 

Thus  , 

the  traffic 

intensity  is 

estimated  by 

Na/t 

P-  ,,  ,  • 

(2.4) 

Ffg/tb 

Consider  the  following  ratio 
p  _  {Ng/tyiNg/tb) 

p  "  (Mi) 

^  (2\Ub/2Ng) 

~  (2Xt/2Na)  '  ^  ■ 

For  large  sample,  Cox  (1965)  stated 
that  2Xt  can  be  treated  as  a  Chi-squared 
variate  with  2Nfi  degrees  of  freedom  and 
2Vitb  as  a  Chi-squared  variate  with  2Ng 
degrees  of  fi’eedom.  Thus,  p/p  has  F- 
distribution  with  degrees  of  freedom  2Ng 
and  2Na.  An  appropriate  probability 
statement  at  significance  level  a  can  be 


X  = 


\i  = 


ib 


(2.1) 


(2.2) 


written  as  follows: 


P 


Fi-a/iC^NgaNa)  ^  ^  F^(2Ng,2Na) 


=  1-a. 


(2.6) 


Therefore,  the  upper  and 
confidence  limits  for  p  are  given  by 

lower 

a  P 

P“  Fi.^(2Ng,2Na)  ’ 

(2.7) 

A, 

F^(2Ng,2Na)  ■ 

(2.8) 

If  /  (p)  is  a  monotonically  increas¬ 
ing  function  of  p,  then  the  100(1  -  a)% 
confidence  intervals  for/ (p)  is 

/(P„)^/(P)^/(Pl)- 

(2.9) 

3.  Estimation  Procedures  for  Priority 
Queues 

Taylor  and  Karlin  (1984)  considered 
a  single  server  queueing  system  with  two 
types  of  customers  so-called  priority  and 
non-priority.  The  customers  arrive 
independently  and  formed  a  Poisson  pro¬ 
cess  with  parameters  a  and  P  respec¬ 
tively.  The  customers’  service  times  are 
independent  and  identically  exponen¬ 
tially  distributed  with  p^tmeters  y  and  5 
respectively.  Service  discipline  is  FCFS 
and  the  service  of  priority  customers  is 
never  interrupted.  A  priority  customer  is 
allowed  to  enter  the  service  immediately 
even  if  another  nonpriority  custoiner  is 
already  present  in  the  service.  The  inter¬ 
rupted  customer’s  service  is  resumed 
when  there  is  no  priority  customer 
present  in  the  system. 

Notations 

System  arrival  rate  =  A,  =  a  +  p 
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Proportion  of  priority  customers 
=p  =o/k 

P^ortion  of  nonpriority  customers 

The  system  mean  service  time  is  the 
approximately  weighted  means  of  the 
priority  and  nonpriority  customers  given 
by 


X 


Y  5 


(3.1) 


where  |J,  is  the  system  service  rate. 

The  traffic  intensity  for  the  system, 
priority  customers  and  nonpriority  custo¬ 
mers  are  given  by 

p  =  —  (3.2) 

|i 

2=—  (3.3) 

7 

and 


x  = 


-1 


It  is  clear  firom  (3.1)  that 
p  =  S  +  t. 


(3.4) 


(3.5) 


Taylor  and  Karlin  (1984)  obtained 
the  mean  queue  length  for  the  priority 
and  nonpriority  customers  in  the  steady 
state  as  follows: 


L 

^  1-S 


and 


Ln  = 


l-Z--r 


1 + (m-^) 


.(3.7) 


Ln  is  finite  if  the  system  traffic 
intensity  p(p  =  £  -t-  x)  is  less  than  1. 

For  a  simple  M/M/1  queueing  sys¬ 
tem  with  traffic  intensity  p,  the  mean 
queue  length  L  is  given  by 


L  =  -2_ 
1-p 


(3.8) 


Suppose  that  the  proportion  p  of  the 
customers  have  priority  and  priority  is 
independent  of  service  time.  Let  8  =  y, 


which  implies  £=pp  and  x  =  qp.  Then 
the  expected  queue  length  for  the  priority 
and  nonpriority  customers  are  given  by 


Lp=  . 

1-pp 


and 


L„  = 


=  -S2- 

1-p 


1-H-^ 

1-pp 


(3.9) 


(3.10) 


Therefore,  the  expected  difference 
of  queue  length  between  nonpriority  and 
priority  customers  is  given  by 


D  —Lfi  —  Lp  — 


r  > 

P 

r 

q-p-^pp 

1-pp 

1-p 

C  J 

(3.11) 

It  can  be  shown  that  D  is  a  mono¬ 
tonic  increasing  function  of  p  (0  <  p  <  1) 
as  follows: 


1 


1 


p  ^-p+pp  i-p 


The  above  is  greater  than  zero,  if 
0<p<l  and  0<p^0.5.  Hence, 
confidence  limits  of  D  can  be  written  by 
substituting  lower  and  upper  limits  of  p 
using  formulae  (2.7)  and  (2.8). 


(3  6)  Numerical  Example 


The  estimate  of  mean  queue  length 
difference  between  nonpriority  and  prior¬ 
ity  customers  by  using  equation  (3.11)  is 
given  by 


D  = 


A  ^ 

P 

^  A  ^ 

q-p+pp 

A-pp. 

1  1-P  J 

(4.1) 


where  p  is  the  estimated  parameter  of 
traffic  intensity  for  the  system. 

The  upper  and  lower  confidence 
limits  for  D  are  given  by 


Du  = 


r 

Pu 

^-P+PPu 

A 

^-PPu 

l-p« 

,  (4.2) 
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and 


Dl  = 


Pl 

1-pp/ 


q-p+ppL 

1-Pl 


(4.3) 


where  Pm  and  are  computed 

using  formulae  (2.7)  and  (2.8)  respec¬ 
tively.  Tables  4.1,  4.2  and  4.3  compute 
the  width  of  90%  confidence  intervals 
with  various  number  of  arrivals  and  ser¬ 
vice  of  customers  at  p  =  0.2.  Similarly, 
Tables  4.4,  4.5  and  4.6  present  the 
corresponding  results  computed  at 
p=0.4. 


Table  4.3.  90%  confidence  intervals  for 
various  values  of  arrivals  and  service 
completions  with  traffic  intensity  p  =  0.5 
and  D  =0.7778  at  p=  0.2. 


/la=«f 

Du 

Dl 

Width  of  Cl 

20 

5.0449 

0.2944 

4.7505 

30 

2.8940 

0.3456 

2.5484 

40 

2.2792 

0.3782 

1.9191 

50 

1.9558 

0.4067 

1.5491 

60 

1.7648 

0.4283 

1.3365 

Table  4.1.  90%  confidence  intervals  for 
various  values  of  arrivals  and  services 
completions  with  traffic  intensity  p  =  0.2 
and  D  =  0.1667  atp  =  0.2. 


Table  4.4.  90%  confidence  intervals  for 
various  values  of  arrivals  and  service 
completions  with  traffic  intensity  p  =  0.2 
andD  =0.0761  at  p  =0.4. 


na=ns 

Du 

Dl 

Width  of  Cl 

20 

0.3656 

0.0854 

0.2802 

30 

0.3105 

0.0969 

0.2136 

40 

0.2853 

0.1032 

0.1821 

50 

0.2673 

0.1088 

0.1585 

60 

0.2557 

0.1127 

0.1430 

Table  4.2  90%  confidence  intervals  for 
various  values  of  arrivals  and  service 
completions  with  traffic  intensity  p  =  0.4 
and  D  =  0.4927  atp  =  0.2. 

na=ns 

Du 

Dl 

Width  of  Cl 

20 

1.7737 

0.2095 

1.5642 

30 

1.2983 

0.2430 

1.0553 

40 

1.1185 

0.2642 

0.8543 

50 

1.0020 

0.2819 

0.7201 

60 

0.9318 

0.2945 

0.6373 

na=ns 

Du 

Dl 

Width  of  Cl 

20 

1.1759 

0.0347 

1.1412 

30 

0.1619 

0.0402 

0.1217 

40 

0.1460 

0.0432 

0.1028 

50 

0.1348 

0.0459 

0.0889 

60 

0.1277 

0.0479 

0.0798 

Table  4.5.  90%  confidence  intervals  for 
various  values  of  arrivals  and  service 
completions  with  traffic  intensity  p  =  0.4 
and  D  =  0.2857  at  p  =  0.4. 

na=fts 

Du 

Dl 

Width  of  Cl 

20 

1.3452 

0.1009 

1.2443 

30 

0.9290 

0.1200 

0.8090 

40 

0.7768 

0.1329 

0.6439 

50 

0.6802 

0.1439 

0.5363 

60 

0.6229 

0.1521 

0.4708 
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Table  4.6.  90%  confidence  intervals  for 
various  values  of  arrivals  and  service 
completions  with  traffic  intensity  p  =  0.5 
and  Z>  =  0.5  at  p  =  0.4. 


na=ns 

Du 

Dl 

Width  of  Cl 

20 

4.4305 

0.1518 

4.2787 

30 

2.3735 

0.1847 

2.1888 

40 

1.8195 

0.2063 

1.1613 

50 

1.5086 

0.2256 

1.2830 

60 

1.3372 

0.2405 

1.0967 

Concluding  Remarks 

The  width  of  90%  confidence  inter¬ 
vals  for  the  expected  difference  of  queue 
length  for  the  nonpriority  and  priority 
customers  in  M/M/1  queueing  system  are 
computed.  Tables  4.1  to  4.6  indicate  that 
the  width  of  confidence  interval 
decreases  as  the  number  of  arrival  and 
service  of  the  customers  increases.  Obvi¬ 
ously,  one  can  observe  easily  from 
Tables  4.3  and  4.6  that  the  confidence 
interval  increases  rapidly  when  the  traffic 
intensity  increases.  There  is  a  possibility 
that  p„  could  be  greater  than  one  when 
formula  (2.7)  is  used  for  computing  p„. 
Under  such  circumstances,  the  statistical 
techniques  have  limitations.  The  cau¬ 
tious  approach  is  required  for  estimating 
the  parameter. 
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Our  developments  are  about  a  study  on  the  representation  and 
on  the  analysis  of  ergonomists  knowledge.  The  context  of 
this  study  is  the  evaluation  of  a  control-room  for  nuclear 
power  plants. 

It  includes  the  creation  and  the  adaptation  of  the  theoretical 
framework  chosen  to  resolve  the  problem  (Sjnnbolic  Data 
Analysis)  and  a  realization  of  a  prototype  coupling  numerical 
algorithms  with  symbolic  methods  using  the  SAS  software. 


1.  Introduction  _ 

EDF  (French  Electricity  National  Company)  is  testing  a  new 
^e  of  control-room  with  computer-based  interface. 

Ergonomists  have  to  evaluate  this  new  control-room.  During 
trials,  operators  who  drive  a  simulated  nuclear  power  plant 
are  observed  by  Ergonomists.  The  latters  note  down  the 
operator's  behaviour  and  they  key  in  the  principal  actions 
(from  the  ergonomic  point  of  view)  :  moving,  grouping, 
speaking,  etc. 

The  evaluation  of  the  control-room  requires  to  have  a  global 
approach  of  the  operator's  operation  methods  : 

1 .  Activity  of  each  operator 

2.  Activity  of  the  team 


11.  Methods  and  algorithms _ 

1)  Knowledge  representation 

The  heterogeneousness  of  data  (operators  have  different 
tasks,  described  with  different  variables),  and  the  studied 
themes  (for  example  the  notion  of  "activity"  in  a  team),  have 
imposed  us  to  consider  the  problem  of  knowledge 
representation  and  computing,  in  the  large  framework  of 
Symbolic  Data  Analysis  [Diday  93].  The  activity  of  a  team 
can  be  defined  by  using  the  mathematical  form  of  the 
synthetic  objects : 


Activityl  :[duration=[0h25,  lhlO[ 

^  [glance(operatorl)={0.9  screen,  0.1  synoptic}] 
[glance(operator2)={0.8  screen,  0.2  elsewhere}] 


2)  Methods 

The  different  proposed  analysis  for  the  studied  themes  have 
been  conceived  using  both  numerical  classical  methods  and 
symbolic  methods.  A  such  idea  is  already  used  in  the 
generalization  of  symbolic  objects  coming  from  machine- 
learning  [Summa  93]. 

3)  Algorithms 

In  mind  to  tackle  complex  data  structures,  we  couple  a 
classical  statistical  software  (SAS)  with  symbolic  methods. 
To  goal  the  different  activities,  we  develop  a  statistical 
toolbox. 


The  symbolic  methods  developed  in  the  toolbox  let  us  use: 

-  variables  with  several  levels  in  the  same  time 
(uncertainties  on  values) 

-  links  between  variables 

(for  example  :  if  "groupment"=no  then  "with  who"  has 
no  sense) 

The  toolbox  can  be  decomposed  in  two  parts : 

Symbolic  methods 

symbolic  histograms 
symbolic  hierarchical  clustering 
symbolic  pyramidal  clustering  (with  base 
constrained) 

symbolic  explanation  of  clusters 
Transformations  methods 
dissimilarity  computing 
probability  computing 

and  the  toolbox  lets  us  use  all  classical  methods  already 
available  in  the  SAS  System. 
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III.  Application  to  the  S3C  project _  V.  Bibliography 


The  interest  of  the  coupling  is  the  facility  to  toggle  between 
classical  methods  and  symbolic  methods. 

A  good  example  of  this  possibility  is  in  the  research  of  a 
plane  representation  of  different  observations  and  in  the 
research  of  the  operator's  activity  trajectories.  These 
activities  are  described  by  the  regroupment  with  the  other 
operators  in  the  control-room. 

Initial  data  describe  the  regroupment  of  the  operator  during 
regular  time  intervals  (one  minute). 

The  methods  used  are  a  sequence  of  classical  methods  and 
symbolic  methods. 

For  the  research  of  plane  representation  : 

1)  Symbolic  dissimilarity  computing  to  transform 
data  in  numerical  form 

then  2)  Classical  Multidimensionnal  Scaling  to 
represent  data  in  a  plane 
then  3)  Classical  K-means  clustering 
then  4)  Symbolic  explanation  of  the  cluster. 


For  the  research  of  the  operator's  activity  trajectories : 

1)  Symbolic  computing  of  probability  to  transform 
data  in  a  numerical  table 

2)  Classical  Factorial  analysis  on  the  numerical 
table 

3)  Compute  the  trajectories  as  supplementary 
individuals 

4)  Symbolic  explanation  of  the  axes 


IV.  Conclusion 


The  union  of  classical  data  analysis  methods  with  symbolic 
data  analysis  methods  allows  to  use  complex  data  sets 
(uncertainties  on  values,  links  between  variables)  keeping  the 
possibility  of  using  classical  data  analysis. 

The  results  might  be  encouraging  (the  real  data  have  not  been 
received  yet)  but  the  development  of  symbolic  methods  with 
a  classical  statistical  software  as  the  SAS  system  is  very 
heavy  (the  basic  structure  of  SAS  data  set  is  made  with  row- 
columns  tables). 

Coupling  an  Object-Oriented  Database  with  a  classical 
statistical  software  using  a  language  like  C-H-  seems  more 
appropriate 
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Symbolic  Data  Analysis 


The  mushroom  data  set 


•  1  St  feature  : 


Individuals  are  described  by  logical  expressions 
Cepe :  [HatColor  a  {  red,  green }] 

A  [FootHelghta{  small }] 


The  description  can  be  complex 

Boletus :  [HatColor  =  { yellow,  brown  }] 

[Height  =  [0,7]  If  HatColorayellow, 

[7 , 15]  If  HatColorabrown] 
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and  can  use  external  knowledge 

if  HatPresencesno  then  HatColor  has  no  sense 
If  HatColor=blackthen  Smell  Is  nauseous  or  pleasant 


HC  a  { r,  g } :  during  the  day,  Hat  Color  has  been  either  red,  or  green. 


•  2nd  feature: 


INDIVIDUALS 


SDA  (continued) 


are  described  In  the  same  syntax 


boletus :  [HatColor  s  { white }] 

*  [Height  a  [1 ,3]) 

meadow  mushrooms :  [HatColor  a  {  white  }] 

A  [Height  =  [1,3]] 


Links  between  variables 

Strong  link :  the  reverse  link  exists 


L _ }  HatColor 

HatPresence  a  absent  =>  and  HatShape 

have  no  sense 

HatColor  has  no  sense 

or  HatShape  has  no  sense  =>  HatPresence  a  absent 


Simple  link :  no  reverse  link 


\\  /  X 


SDA  (continued) 

•  3  rd  feature  : 

Several  ways  to  describe  individuals  or  concepts 
Intension 

meadow  mushrooms  :  [HatColor  s {white  }} 

A  [Height  a  [1 ,3]] 

Extension 

among  the  set  of  initial  Individuais  Cl 

extfl(meadow) :  {chanterelle,  agaric) 

among  the  set  of  possible  descriptions  0 

exte(cepe) »  {(circle,  red),  (circle,  green),  (square,*  red) 
(square,  green) } 


Study  of  the  mushroom  evolution 

Trajectories  of  the  green  house  evolutions  on  the  first  factorial  plane  (PCA) 
M  transformation  of  the  data  in  a  pseudOKiIsjunctIve  form 
»  principal  component  analysis 
»  trajectory  drawings 


lb*  grawi  houM  avofulkm  1ni}«ctoriM 
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Transformation  of  the  data 
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and  havhg  a  daacriplion  compitHii*  wMi  lha  Individuat  muahroom 


Disjonctionofthe  data 

Computation  by  Indtvldual  and  by  variable 


Study  of  the  mushroom  evolution 

•  Trajectories  of  the  green  house  evolutions  on  the  first  factorial  plane  (PCA) 
»  transformation  of  the  data  in  a  pseudo-disjunctive  form 
»  principal  component  analysis 
»  trajectory  drawings 


Transformation  of  the  data  (continued) 


Mushroom 
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Not  using  links 
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Using  links 
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Symbolic  interpretation  of  the  factorial  axes 


•  The  axes  (asO.7) 


CXJUn  INTBR 
«X«la  0.142t« 
Mwlp  1.00000 
•JM3n  0.20571 
Ajulp  0.00000 


pncxm  DI0C1U1I 
0.24110  1.00000 
0.03440  1.00000 
0.24130  0.70000 
0.55172  0.72727 


RICODV  VARl 
1.0000  hp 
1.0000  ho 
1.0000  fh 
1.0000  bp 


MODI  VAU  HOD! 

a 

g  n 

■ 

P 


•  The  positive  extremity  of  the  second  axe  (asO.7,  percentage 
computing) 

CLMm  ZMTXIt  PXSCBNT  DZKIUM  KXCOOV  VARl  MODI  VAR2  MOSS 
UMla  0.14205  0.24130  1.00000  1.0000  hp  A 

UMlp  1.00000  0.03440  0.77770  1.0000  ha  ■ 

UM2a  0.20571  0.24130  0.00000  1.0000  fh  a 

axa2p  0.00000  0.55172  0.72532  0.0125  a»  p 

•X«2p  0.00000  0.55172  1.00000  0.1250  ha  t 

axa2V  0.00000  0.55172  1.00000  0.0625  ha  e 


Principal  Component  Analysis 


AeUvaalndNiduala 


Variance  explained :  61  %  (40  +  21) 

Principal  trend :  hat  absence/presence 

Second  axis  ;  small  foot /medium  height  foot  *  pleasant  smell 


Conclusions  and  Prospects 

•  Classical  4-  Symbolic  Data  Analysis  Methods 

•  Supporting  complex  structures 

»  uncertainty  on  variable  values,  links  between  variables 

•  Keeping  the  power  of  classical  methods 

•  Strategy  of  analysis 

•  Mushroom  evolution 

»  combination  of  methods  (symbolic  *  classical ) 

•  Prospects 

•  Application  on  real  data  sets 

•  Interface  with  an  Object  Oriented  Data  Base 

•  Exploration  of  other  features  of  SDA 

»  Modal  objets 
»  Hordes 
>•  ... 
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Abstract 

Kolassa  and  Tanner  (1994)  present  the  Gibbs-^Skovgaard 
algorithm  for  approximate  conditional  inference.  This 
algorithm  makes  use  of  the  double  saddlepoint  ap¬ 
proximation  to  the  conditional  distribution  function 
of  a  sufficient  statistic  given  the  remaining  sufficient 
statistics.  This  approximation  is  used  with  the  Gibbs 
Sampler  to  generate  a  Markov  chain.  The  equilibri¬ 
um  distribution  of  this  chain  approximates  the  joint 
distribution  of  the  suflScient  statistics  associated  with 
the  parameters  of  interest  conditional  on  the  observed 
values  of  the  sufficient  statistics  associated  with  the 
nuisance  parameters.  In  this  paper  recent  extensions 
to  this  methodology  are  recounted,  and  open  questions 
related  to  the  existence  and  accuracy  of  the  resulting 
approximation  to  the  desired  distribution  are  discussed. 

1.  Introduction 

Kolassa  and  Tanner  (1994)  construct  an  algorithm  for 
simulating  observations  from  distributions  approximat¬ 
ing  null  conditional  distributions  in  generalized  linear 
models,  in  order  to  construct  conditional  significance 
tests.  The  suggest  using  the  Gibbs  sampler  to  construct 
a  Markov  Chain  whose  null  distribution  is  the  condi¬ 
tional  distribution  of  interest,  and  approximating  this 
chain  by  sampling  from  the  double  saddlepoint  condi¬ 
tional  cumulative  distribution  function  approximation 
of  Skovgaard  (1987)  instead  of  from  the  true  conditional 
cumulative  distribution  functions.  This  approxima¬ 
tion  depends  on  a  parameter  m  roughly  measuring 
the  number  of  independent  and  identically  distributed 
observations  represented  in  the  data  set.  Besag  and 
Clifford  (1989,  1991)  discuss  methods  by  which  such 
a  Markov  chain  may  be  used  for  frequentist  inference. 
This  paper  surveys  work  extending  that  of  Kolassa 
and  Tanner  (1994)  in  a  number  of  ways.  Theorems 
are  cited  governing  irreducibility  and  ergodicity  of  the 
constructed  Markov  chain.  Accuracy  of  the  resulting 
equilibrium  distribution  as  an  approximation  to  the 
desired  distribution,  the  use  of  a  higher-order  approxi- 

*  Supported  by  grant  CA  63050  from  the  National 
Institutes  of  Health 


mation  of  Kolassa  (1992a),  and  the  extension  of  Kolassa 
(1992b)  to  cases  in  which  the  saddlepoint  is  not  defined, 
are  all  discussed. 

Tierney  (1991)  reviews  Markov  chain  convergence 
results  in  the  more  general  case  in  which  Hilbert 
space  techniques  are  inapplicable;  this  paper  makes 
use  of  such  methods.  These  methods  are  similar  to 
those  used  by  Roberts  and  Poison  (1994),  but  are 
extended  to  the  case  where  the  sampling  performed 
is  only  approximately  according  to  the  Gibbs  scheme, 
using  the  double  saddlepoint  approximation.  Roberts 
and  Smith  (1994)  discuss  conditions  of  aperiodicity 
and  irreducibility  necessary  for  convergence.  These 
questions  are  considered  in  this  paper. 

This  paper  is  organized  as  follows.  First,  Markov 
chain  terminology,  Gibbs  sampling,  and  the  double 
saddlepoint  distribution  function  approximation  are 
reviewed.  An  example  of  the  method  of  Kolassa  and 
Tanner  (1994)  is  recounted.  Irreducibility  of  the  their 
Markov  chain  is  considered  in  the  setting  of  regression 
problems.  Results  are  recounted  showing  geometric 
convergence  to  an  equilibrium  distribution,  dependent 
on  m.  A  simple  example  is  given  demonstrating  that 
stronger  convergence  is  not  in  general  possible.  The 
equilibrium  distribution  is  conjectured  to  converge  to 
the  target  distribution  as  m  increases. 

2.  Markov  Chain  Terminology 

The  methods  used  in  this  paper  to  prove  convergence 
of  the  constructed  Markov  chains  are  similar  to  those 
used  by  other  authors.  To  make  connections  between 
this  work  and  other  Markov  chain  literature  clearer, 
some  common  definitions  concerning  Markov  chains  are 
introduced.  The  first  defines  the  structure  of  transitions 
from  one  step  in  the  chain  to  another.  The  second 
considers  whether  the  state  space  may  be  divided  into 
two  spaces,  between  which  the  chain  never  travels. 
If  this  is  the  case,  there  are  an  infinite  number  of 
equilibrium  distributions  for  the  chain,  depending  on 
how  much  mass  is  initially  allocated  to  each  subspace. 
The  third  definition  concerns  whether  the  measure 
induced  by  certain  transitions  in  the  chain  can  be 
bounded  below  by  a  measure  that  does  not  depend  on 
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the  initial  transition.  Nummelin  (1984)  presents  many 
Markov  chain  convergence  results  depending  on  this 
property. 

Definition  2.1  :  Suppose  transitions  in  a  Markov 
chain  with  state  space  %  from  state  y  are  given 

by  the  density  P{y,t)  with  respect  to  a  measure  p 
and  Borel  sets  T  relative  to  the  relevant  topology  on 
Let  P(j/,  •)  be  the  associated  measure  on  T,  and 
recursively  define  the  measures  P^^\y^  •)  =  P(2/,  •)»  sind 

Definition  2.2  :  For  any  measure  y  on  T,  the  chain 
is  /i-irreducible  if  P[3n  3  G  A\To  =  t]  >  0 
for  all  t  G  T  and  for  alM  G  T  3  p{A)  >  0.  Equivalently, 
the  chain  is  irreducible  if  p{{y\J2m-0  ~ 

0})  =  0  Vf  G  3:. 

Definition  2.3  :  A  set  C  C  3^  is  small  if  and  only  if 
there  exists  a  constant  a  >  0,  and  a  probability  measure 
1/  on%  such  that  P(t,  *)  >  ai/(-)  for  t  EC. 

Nummelin  (1984)  describes  several  forms  of  conver¬ 
gence  results  for  Markov  chains.  Two  are  considered 
here: 

Definition  2.4  ;  A  Markov  chain  is  uniform  ergodic, 
if  there  exists  a  probability  measure  tt,  a  constant 
r  G  (0, 1),  and  a  constant  M  such  that  •)  — 

<  Mr”,  and  is  geometrically  ergodic,  if  which 
there  exists  a  probability  measure  tt,  a  constant 
r  G  (0, 1),  and  a  function  M(y)  such  that  |lP^”^(t/,  •)  — 
^(‘)IItv  <  M(y)r”.  Here  ||  •  \\tv  is  the  total  variation 
norm  on  the  space  of  finite  measures  on  It. 

Definition  2.5  :  A  Markov  chain  has  period  q  if 
there  exist  disjoint  measurable  subsets  To, . . . ,  of  T 
such  that  G  =  t]  =  0  whenever  t  G  Ti 

and  i  =  (j  —  1)  mod  and  if  q  is  the  largest  integer 
having  this  property. 

3.  Gibbs  Sampling 

The  Gibbs  sampler  is  a  popular  Markov  chain  method 
useful  for  yielding  a  sample  from  a  posterior  or  likelihood 
density.  It  was  first  introduced  by  Geman  and  Geman 
(1984)  in  the  context  of  image  reconstruction.  The  data 
augmentation  algorithm  of  Tanner  and  Wong  (1987), 
introduced  as  a  device  for  the  calculation  of  posterior 
distributions,  is  a  ‘two-component’  version  of  the  Gibbs 
sampler.  See  Tanner  (1993)  for  background  details  and 
important  references. 

Let  the  symbol  p(*  *  •  |  •  •  •)  denote  the  distribution  of 
those  random  variables  listed  before  the  vertical  line 
conditional  on  those  listed  after,  and  let  the  vector  T^j 


denote  the  vector  T  with  component  j  deleted.  To 
obtain  a  sample  from  the  joint  conditional  distribution 
p(Ti ,  •  *  • ,  Ta  |Ta+i , . . . ,  Td)y  the  systematic  scan  Gibbs 
sampler  iterates  the  following  loop:  Sample 

1)  from  •  •  • , Tt~'\Ta+i, . .  .,Ta). 

2)  . .  ..Ta). 

a)  from  p(T„ •  •  • ,  ,  T^+i , . . . ,  T^). 

If  the  algorithm  converges,  for  a  sufficiently  large 
value  of  n  we  can  take  as  a  simu¬ 

lated  observation  from  the  equilibrium  distribution 
p(Ti,  •  •  • ,  TalTa^-i, . . . ,  Td)  of  the  Markov  chain.  Inde¬ 
pendently  replicating  this  Markov  chain  /  times  produces 
an  independent  and  identically  distributed  sample  of 
size  I  from  the  distribution  of  interest. 

4.  Double  Saddlepoint  Approximation 

Often  the  one-dimensional  marginal  distributions  re¬ 
quired  for  Gibbs  sampling  are  unavailable.  Kolassa  and 
Tanner  (1994)  suggest  instead  sampling  from  double  sad¬ 
dlepoint  approximations  to  the  appropriate  conditional 
cumulative  distribution  functions.  The  double  saddle- 
point  cumulative  distribution  function  approximation  of 
Skovgaard  (1987)  generalizes  the  secant  approximation 
due  to  Lugannani  and  Rice  (1980)  examined  by  Skates 
(1993).  Suppose  a  vector  T  arises  as  the  mean  of  m 
independent  and  identically  distributed  random  vectors, 
each  with  cumulant  generating  function  K,  and  one 
wishes  to  approximate  the  distribution  of  r„  conditional 
on  the  value  of  =  (7i ,  • » ■ ,  T,i_i ,  ,  •  •  • ,  Td).  In 

the  context  above  this  will  be  applied  for  u  <  a.  The 
double  saddlepoint  approximation  involves  solving  the 
multivariate  saddlepoint  equations  both  for  the  full 
distribution  of  T  and  for  the  distribution  of  the  shorter 
random  vector  r_u«  The  approximate  conditional 
distribution  function  is 


where 


Wi  =  -P'^)K'0)  -  K0)  +  Km, {2) 

and  ^  and  solve 

K^{p)  =  P  Vi  and  =  P  Vj  #  =  0,  (3) 
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is  the  (d— l)x(d— 1)  submatrix  of  second  derivative 
matrix  of  K,  corresponding  to  all  components  of  ^  and  T 
except  the  first,  and  ^  and  <j>  are  the  normal  distribution 
function  and  density  respectively.  The  vectors  and 
represent  maximum  likelihood  estimators  for  the 
canonical  exponential  family  containing  T. 

Also  of  interest  are  inversion  techniques  for  lattice 
distributions.  Skovgaard  (1987)  derives  a  counterpart 
of  (1)  in  the  lattice  case,  in  which  Pu  is  replaced  by 
2sm\\{^Pu)  in  the  definition  of  z,  and  in  which  is 
corrected  for  continuity  when  calculating  That  is, 
if  possible  values  for  Tu  are  A  units  apart,  solves 
K\P)  =  i  where  P  =  P  if  j  ^  u  and  ^ A. 

Later  results  concerning  using  the  double  saddlepoint 
approximation  in  conjunction  with  Gibbs  sampling  will 
require  that  the  approximation  be  monotone.  The 
derivative  of  (1)  is 

r-  -  (■,  .  idz/dwi)z-'^w:[^  -  zw'[^\  dwi 

+ - ;;; - )  'W' 

(4) 

this  constitutes  an  approximate  conditional  density  for 
Tu  in  the  continuous  case,  and  in  the  lattice  case 
probability  atoms  are  integrals  of  (4).  Then  (1)  is  a 
distribution  function  for 

m  >  max(iit;f^  ~  ^).  (5) 

Skates  (1993)  discusses  monotonicity  of  similar  distribu¬ 
tion  function  approximations. 

We  evaluate  the  monotonicity  criterion,  (5).  Note 

that  dzjdi^  =  z  ,  and 

dw^ldt'^  =  («’  +  ih  -  -  «‘) 


Also,  K\P)  =  =  5‘“,  and  ^  =  kui-  Then 

dz/dt'^  =  z  ^kuu/^u  +  and  dwi/di'‘  = 

i2)jf^(4u  -  Ai)  =  since  =  0.  Hence  dz/dwi  = 

zwi{kuu/0u+^Kiik*^^kui)/^u-  Furthermore,  the  density 
associated  with  (1)  can  be  expressed  as 

X  (i  +  (dz/dwi)z-^wj;'^  -  zwi^^ 

Terms  of  order  0{^/tr^)  comprise  the  double  saddlepoint 
density  approximation  of  Barndorff- Nielsen  and  Cox 
(1979).  For  further  discussion  and  references  see 
Kolassa  (1994a). 


5.  An  Example 

Kolassa  and  Tanner  (1994)  apply  Gibbs-Skovgaard 
approach  to  higher- way  contingency  tables.  Consider  the 
distribution  of  elements  in  c?i  xc?2  xda  contingency  tables, 
expressed  as  Xijk  where  i  €  {1,  *  *  * ,  di},  i  E  {1,  •  •  ‘,^2}, 
and  k  G  {l,-**,d3},  conditional  on  one  dimensional 
marginals.  Express  the  table  in  terms  of  did2d3  sufficient 
statistics,  of  which  1  is  the  overall  total,  di  ~  1  are  first 
unidimensional  totals,  ^2  1  are  second  unidimensional 

marginal  totals,  and  da  -  1  are  third  unidimensional 
marginal  totals,  (di  —  l)(d2  -  1)  are  first  bidimensional 
totals,  (di  —  l)(d3  —  1)  are  second  bidimensional  totals, 
(d2  —  l)(d3  —  1)  are  third  bidimensional  totals,  and  the 
remaining  (di  —  l)(d2  —  l)(d3  —  1)  sufficient  statistics 
are  the  entries  with  none  of  their  indices  at  the 
highest  values.  The  first  di  -h  d2  +  da  -  2  sufficient 
statistics  are  ancillary  to  the  null  hypothesis  of  complete 
independence  nested  within  the  saturated  model  for 
Poisson  means.  Other  sufficient  statistics  are  ancillary 
and  are  conditioned  on.  Consider  the  following  data 
describing  the  presence  or  absence  of  torus  mandibularis 
among  male  and  female  Inuits  aged  41-50  in  three 
different  groups,  collected  by  Muller  and  Mayhall 
(1971),  and  cited  by  Bishop,  Fienberg,  and  Holland 
(1975): 

Sex  Igloolik  Group  Hall  Beach  Group  Aleut  Group 
Pres.  Abs.  Pres.  Abs.  Pres.  Abs. 

M  10  0  4  2  4  5 

F  6  4  4  0  2  2 

These  data  were  chosen  to  assess  the  quality  of  the 
Markov  chain  algorithm  in  a  situation  in  which  usual 
asymptotic  approximations  may  be  inappropriate.  The 
hypothesis  of  independence  was  tested  by  generating 
random  tables  using  the  Gibbs  sampler  and  Skovgaard’s 
approximation  as  described  above.  Statistics  for 
Pearson^s  Test  and  the  Likelihood  Ratio  Test  were 
calculated  for  each  simulated  table. 

We  simulated  5,000  independent  Markov  chains  for 
200  iterations.  For  each  integer  n  between  1  and  200 
we  estimated  the  p-value  after  iteration  n  generated  by 
each  test  statistic  by  calculating  the  test  statistic  for 
the  observed  table,  and  for  each  of  the  simulated  tables 
represented  by  the  state  of  the  chain  at  time  n.  We 
report  as  the  estimated  p-value  the  proportion  of  sample 
tables  with  a  test  statistic  value  as  high  or  higher 
than  the  observed  value.  Convergence  was  assessed  by 
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observing  how  these  estimates  change  with  n;  in  this 
example  after  fewer  than  25  iterations  the  estimated 
p-values  become  stable.  The  asymptotic  p-values  based 
on  the  Pearson  and  likelihood  ratio  statistics  are  0.004 
and  0.001,  respectively.  The  corresponding  values  for 
the  Gibbs-Skovgaard  algorithm  are  0.136  and  0.063. 

6.  Irreducibility  of  Chains  for  Regression  Models 

Kolassa  (1994b)  considers  certain  regression  models, 
and  determines  when  the  Gibbs  sampling  Markov  chain 
applied  to  the  sufficient  statistics 

T  =  ,  with  y  €  n  (6) 

i 

is  irreducible.  The  continuous  case  result  is  straight¬ 
forward: 

Theorem  6.1  :  For  the  statistics  T  of  (6),  where 
each  2)y  is  a  connected  open  subset  of  M,  and  if  a 
formal  Gibbs  sampling  scheme  which,  when  sampling 
component  u  conditional  on  samples  from  a 

distribution  on  {tu\t  €  %  t^u  =  having  a  positive 

density  with  respect  to  Lebesgue  measure,  then  the 
chain  is  irreducible. 

The  following  example  shows  that  the  discrete  case 
is  more  delicate.  Suppose  that  2}ti  is  a  subset  of 
the  integers  from  1  to  M  ~  1  for  each  u.  Let  Z 
be  the  dxm  matrix  with  (1,M,  and 

(1,  A/  +  1,  (M  +  1)^, . . . ,  (M  +  1)^““^)  as  two  columns, 
and  the  rest  arbitrary.  Conditioning  on  the  sufficient 
statistic  associated  with  either  of  these  two  columns  in 
effect  conditions  on  all  of  the  Y*,  since  the  Y  can  be 
reconstructed  from  each  of  these  sufficient  statistics  by 
themselves.  Gibbs  sampling  in  this  situation  will  fail. 

Consider  a  second  logistic  regression  example,  in 
which  the  first  and  last  components  of  the  sufficient 
statistic  are  conditioned  on: 
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1 

0 

0 

0 
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1 

1 

1 

0 

No  series  of  rearrangements  of  the  indicators  in  v,  each 
keeping  the  first,  last,  and  second  or  third  components 
of  s  fixed,  will  draw  s  closer  to  t. 

One  might  be  tempted  to  try  to  extend  the  above 
argument  to  discrete  distributions  on  T  that  will  hold 
asymptotically  as  the  density  of  discrete  points  of  X 
increases.  However,  such  applications  usually  involve 
positive  probability  on  the  boundary  points  of  the  sets 
j ,  which  may  leave  vertices  of  T  forming  subsets  of  the 
state  space  not  communicating  with  other  state  space 
points. 

Instead,  combinatoric  arguments  examining  rear¬ 
rangements  of  the  counts  in  y  are  necessary  to  prove 
the  following  theorem: 

Theorem  6.2  :  For  the  statistics  T  of  (6),  where 

1.  each  fQj  is  the  intersection  of  a  connected  subset  of  M 
and  Z,  each  with  at  least  two  elements. 

2.  The  matrix  Z  has  a  column  of  ones  as  its  first  column, 
and  consists  entirely  of  zeros  and  ones. 

3.  There  exists  a  path  through  the  corresponding  rows  of 
Z,  where  two  rows  are  connected  if  they  are  identical 
except  for  one  entry. 

4.  None  of  the  first  d  components  of  the  sufficient 
statistic  are  at  their  extreme  values. 

Then  the  associated  Gibbs  sampling  Markov  chain 
associated  with  conditioning  on  the  first  d  entries  of  T 
is  irreducible. 

Corollary  6.3  :  The  result  of  Theorem  6.2  holds  if 
condition  3  is  replaced  by 

3.  For  each  row  z  with  a  non-fixed  unit  entry,  say  in 
column  Uy  there  exists  a  row  w  identical  to  z  in  Z, 
except  that  w  has  a  zero  in  column  u,  and  these  pairs 
exhaust  Z. 

7.  Convergence  of  the  Markov  Chain 

Kolassa  (1994b)  demonstrates  convergence  of  the 
Markov  chain  constructed  by  Kolassa  and  Tanner 
(1994)  by  showing  that  certain  sets  are  small  in  the 
sense  of  Definition  2.3,  and  by  using  convergence  criteria 
given  by  Nummelin  (1984)  to  demonstrate  the  existence 
of  an  equilibrium  distribution.  To  apply  this  defini¬ 
tion  a  dominating  distribution  must  be  found.  This 
dominating  distribution  is  expressed  in  terms  of  TY,  by 
embedding  the  Markov  chain  in  a  larger  sample 
space,  allowing  the  quantities  W  to  be  calculated  from 
sample  points,  to  obtain: 

Theorem  7.1  :  If  {dz/dwi)z“^Wi^  —  zw^^  is  uni¬ 
formly  bounded  above  by  a  constant  ei  less  than  unity, 
and  if  it  is  uniformly  bounded  below,  then  the  Markov 
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chain  is  geometrically  ergodic. 

Schervish  and  Carlin  (1992)  discuss  strategies  for 
demonstrating  that  a  Markov  chain  is  geometrically 
ergodic,  and  Tierney  (1991)  discusses  strategies  for 
demonstrating  both  kinds  of  ergodicity.  The  following 
counterexample  shows  that  in  a  very  simple  example 
of  Gibbs  sampling,  the  resulting  chain  is  not  uniform 
ergodic,  implying  that  Theorem  7.1  is  as  strong  as 
one  can  expect.  Tierney  (1991)  notes  that  uniform 
ergodicity  holds  if  and  only  if  the  entire  sample  space  is 
small,  and  I  show  that  this  is  not  the  case. 

Suppose  T  is  bivariate  normal,  parameterized  as 
iV(o,  17),  and  hence  K{P)  =  Express  17  as 

Applying  the  standard  Gibbs  sampler  to 

r,  the  distribution  of  conditional  on  is 

multivariate  normal,  =  />(!? 

and  Var[r(”)lT(”-'^)]  is  a  positive  definite  matrix 
not  dependent  on  Hence  for  any  bounded 

measurable  set  A  C  P[r(«>  G 
0  as  oo,  and  hence  the  entire  sample  space 

is  not  small. 

8.  Accuracy  of  Equilibrium  Distributions 

Schervish  and  Carlin  (1992)  define  a  space  %  =  Z^(///p) 
consisting  of  the  functions  /  on  T,  meeisurable  with 
respect  to  T,  such  that  \\f\\n  =  /  P{y)K^y)/p(y)  < 
oo,  where  p  is  the  density  of  the  equilibrium  distribution. 
Then  W  is  a  Hilbert  space  (Rudin,  1976),  and  Ti  3  p. 
For  a  transition  density  P  formed  from  Gibbs  sampling, 
let  S  be  the  operator  g  P(t,y)g(t)p{di).  This 
operator  maps  an  unconditional  distribution  on  the 
state  space  of  the  Markov  chain  to  the  distribution  after 
one  iteration.  For  every  g  G  L^{plp)^ 

j 9{y)  i^sKy)  dy,  jaiy)  dy  =  j(S*g){y)  dy, 

where  S*  is  the  adjoint  operator  to  5.  These  facts  are 
used  to  show  that  when  S  is  restricted  to  the  set  of 
functions  in  H  integrating  to  unity,  the  norm  of  the 
resulting  operator  is  strictly  less  than  one.  Liu,  Wong, 
and  Kong  (1994)  perform  similar  calculations  for  the 
space  L^(/xp).  The  Banach  space  fixed-point  theorem 
can  then  be  used  to  show  that  the  chain  converges 
geometrically. 

In  the  absence  of  knowledge  that  the  transition  prob¬ 
abilities  of  the  approximate  Gibbs  sampler  correspond 
to  conditional  distributions  from  some  unknown  joint 
distribution,  the  conditional  distribution  p  used  to  de¬ 
fine  7i  is  undefined,  and  the  above  construction  fails.  If, 


alternatively,  another  appropriate  norming  distribution 
is  substituted,  the  second  equality  in  (7)  fails.  Hence 
techniques  of  §7  were  necessary  to  prove  the  existence 
of  the  equilibrium  density  pm^ 

Once  existence  of  an  equilibrium  density  is  demon¬ 
strated,  Hilbert  space  techniques  might  be  used  to  show 
that  the  equilibrium  distribution  pm  approximates  p  to 
the  proper  order.  This  work  is  currently  in  progress. 

9.  Further  Work 

Kolassa  (1992a)  presents  the  following  higher-order 
double  saddlepoint  approximation  to  conditional  tail 
probabilities: 

Theorem  9.1  :  The  second-order  saddlepoint  ap¬ 
proximation  to  the  conditional  cumulative  distribution 
function  to  0(m“®/^)  is, 


1  —  F(x^\x^, . . . ,  x^)  =  1  —  ^{y/mwi)  (j>(y/mwi) 

1 _ ]_  1 

(y/mwi)^  y/mwi  ^/mz 

+  —  [5(^4  -  Pi)  -  5(pi3  -  ha)  -  n(^23  -  ha) 

_  1  \ 

'  01 

where  f  is  given  by  (2).  The  invariants  are  given  by 
Pi3  =  P23  =  and 

^4  =  P4,  Pi3,  and  are  the  corresponding 

quantities  calculated  from  and  k^^^K 

Kolassa  (1992b)  considers  application  of  these  meth¬ 
ods  in  logistic  regression  problems  when  a  solution  to 
(3)  does  not  exist.  Consider  a  model  for  counts  of 
binary  outcomes  y>:  For  each  z  G  {1, . . .,  M}  let  Yi  be 
the  number  of  successes  in  Ni  Bernoulli  trials,  each  with 
success  probability  tt,-  =  exp(7;j)/(l -h  exp(7;,)),  where 
7]i  =  Zi/3,  The  quantities  x,-  G  are  row  vectors  of 
covariates.  Let  Y  and  N  be  the  vectors  of  the  number 
of  successes  Yiy  and  the  number  of  binary  trials  TV,-,  each 
with  M  components.  Sufficient  statistics  T  are  given 
by  (6).  For  some  Y  one  or  more  components  of  the 
saddlepoint  may  be  infinite.  This  is  likely  to  happen 
when  the  binary  outcomes  associated  with  a  certain 
covariate  vector  Zi  are  all  successes  or  all  failures. 

Albert  and  Anderson  (1984)  provide  a  diagnostic  for 
whether  all  saddlepoint  components  are  finite.  Clarkson 
and  Jennrich  (1991)  determine  which  covariate  vectors 
are  associated  with  fitted  probabilities  of  0  or  1.  Kolassa 
(1992b)  extends  (1)  to  this  case: 
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Theorem  9.2  :  Consider  the  logistic  regression  prob¬ 
lem  above,  with  sufficient  statistics  given  by  (6).  Let 
W  he  Z  with  the  column  corresponding  to  covariate  j 
removed.  Choose  s  and  r  in  [0, 1]^.  Maximize  the  sum 
of  all  components  of  s  and  r  subject  to 

-  N)s  -  =  o, 

(J  -  W{W'^W)’^^W'^){s  -  r)  =  o. 

Let  8*  and  r*  be  the  maximizing  values  of  these  vectors. 
Classify  all  observation  indices  j  into  one  of  three 
mutually  exclusive  and  exhaustive  sets  as  follows:  An 
index  j  is  assigned  to  Va  if  the  maximum  above  occurs 
when  rj  =  1.  An  index  j  is  assigned  to  Vb  if  the 
maximum  above  occurs  when  sj  =  1.  The  remaining 
indices  are  assigned  to  to  Vc  if  the  maximum  above 
occurs  when  sj  =  r J  =  0.  Let  Wi  be  the  matrix  of  rows 
oiW  whose  indices  are  in  Va  UPb,  multiplied  by  —1  if 
for  rows  corresponding  to  indices  in  and  let  W2  be 
the  matrix  of  the  remaining  rows  of  W  whose  indices  are 
in  Vc-  Let  V  be  the  matrix  formed  by  inserting  as  row 
j  of  Wf  a  vector  of  zeros.  Prepend  as  the  first  column 
of  V  a  vector  of  zeros  except  with  1  in  coordinate  j. 
Let  be  the  result  of  performing  Gram-Schmidt 

orthonormalization  on  the  columns  of  V,  Let  U  be  any 
matrix  with  d  —  rank{V)  orthogonal  columns  such  that 
U^U'^  =  o;  17  may  be  constructed  by  completing  the 
Gram-Schmidt  process.  Solve  =  o  subject 

to  up  =  o,  using  Newton  iterations  of  the  form 


/3  -  )3o  =  t) 


The  vector  is  obtained  similarly.  Let  Ey(^)  ~ 
Ylj  —  v(^))zi  +  U^U,  where  the  summation 

is  over  j  G  "Dc  ■  Then 


P{Tj  <  =  t-j)  =  ^{y/mw)  —  <j)(y/mw)x 


a 

•1» 

\ 

c 

f  m 

\ 


1 


+  0(m-3/2). 
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Abstract 

In  this  paper  we  propose  two  simple  one-step  meth¬ 
ods  for  approximating  distribution  quantiles  of  statis¬ 
tics,  The  methods  are  derived  by  approximately  invert¬ 
ing  saddlepoint  formulas  for  the  cumulative  distribution 
functions  of  the  sample  mean.  The  resulting  formulas 
are  suitable  for  use  with  a  pocket  calculator.  Their  ex¬ 
tensions  to  conditional  inference  and  finite  population 
problems  are  also  discussed. 

key  words  and  phrases:  Conditional  distribution, 
Cornish-Fisher  approximation,  Finite  population,  Quan¬ 
tiles,  Saddlepoint  methods. 

1.  INTRODUCTION 

The  problem  of  approximating  the  distribution  of  a 
statistic  is  an  important  one  in  statistical  theory  and  in 
practice.  The  normal,  Edgeworth  and  saddlepoint  ap¬ 
proximations  are  the  three  most  commonly  used  meth¬ 
ods.  The  normal  approximation  is  very  simple,  but  often 
inaccurate,  especially  for  small  sample  sizes.  The  Edge- 
worth  expansions  are  slightly  more  complicated,  but  the¬ 
oretically  more  appealing.  While  they  usually  improve 
over  the  normal  approximations,  their  numerical  accu¬ 
racy  is  still  often  questionable.  Even  worse,  they  have 
some  undesirable  properties,  such  as  negative  tail  prob¬ 
abilities. 

Saddlepoint  methods,  on  the  other  hand,  generally 
provide  accurate  approximations  whenever  they  are  ap¬ 
plicable.  They  have  played  an  increasingly  important 
role  in  statistics  since  its  introduction  into  statistics  by 
Daniels’  (1954)  pioneering  paper,  especially  during  the 
last  decade.  See  Daniels  (1987),  Reid  (1988)  and  Field  & 
Ronchetti  (1990)  for  general  reviews  of  the  background 
and  development  of  saddlepoint  methods. 
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In  many  statistical  problems  we  do  not  directly  use 
the  distribution  of  a  certain  statistic.  Instead,  we  need 
the  quantiles  of  the  distribution,  which  can  be  obtained 
by  inverting  the  distribution  function.  The  inversion  of 
the  normal  approximation  is  easily  accomplished,  even 
on  a  pocket  calculator.  Formulas  for  the  inverse  of  the 
Edgeworth  approximations  are  available.  They  are  of¬ 
ten  called  Cornish-Fisher  approximations;  see  Witters 
(1984)  for  a  recent  development  in  this  area. 

One  major  disadvantage  of  saddlepoint  approxima^ 
tions  has  been  that  their  analytic  inversions  are  not  avail¬ 
able  and  no  quick  and  easy  numerical  procedures  are 
developed  for  practitioners,  possibly  armed  only  with  a 
calculator,  although  lengthier  procedures  do  exist,  e.g., 
the  method  of  bisection  search.  This,  together  with  the 
fact  that  the  saddlepoint  expansions  are  based  on  slightly 
more  mathematical  analysis,  makes  the  accurate  approx¬ 
imation  methods  less  appealing  to  practitioners. 

Attempting  to  remedy  this  drawback,  in  this  note  we 
present  two  simple  one- step  methods  to  compute  ap¬ 
proximate  saddlepoint  expansions  for  quantiles.  Using 
the  results  of  the  interesting  work  of  Jensen  (1992),  two 
Newton-Raphson  type  numerical  procedures  are  derived 
in  Section  2  that  are  easily  implemented  on  a  pocket 
calculator.  We  show  that  our  one-step  methods  are  gen¬ 
erally  accurate  enough  for  most  applications.  Iterative 
adjustments  with  quick  convergence  to  the  true  saddle¬ 
point  quantiles  are  suggested.  Section  3  describes  an 
extension  of  the  one-step  methods  for  quantiles  of  a  con¬ 
ditional  distribution.  Two  examples  are  considered  in 
Section  4  to  demonstrate  their  numerical  performance. 

Note  that  the  two  proposed  methods  in  this  paper 
are  aimed  to  make  the  saddlepoint  approximations  more 
convenient  and  easier  to  use,  and  thus  more  attractive  to 
practitioners.  When  there  are  computers  conveniently 
available,  it  is  more  natural  to  use  in  a  program  the 
well-known  methods  of  bisection  and  Newton-Raphson 
to  calculate  quantiles,  although  the  two  new  methods  de¬ 
veloped  here  are  still  serious  competitors.  For  example, 
the  second  method  given  in  Section  2.2  does  not  even 
need  to  solve  (2)  for  the  saddlepoint.  This  is  different 
from  all  the  existing  approaches.  Hesterberg  (1994)  gives 
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alternative  methods  for  obtaining  saddlepoint  quantile, 
distribution  function,  and  inverse  distribution  function 
estimates. 

2.  TWO  ONE-STEP  SADDLEPOINT 
APPROXIMATIONS  FOR  QUAN¬ 
TILES 

In  this  section  we  consider  two  one-step  saddlepoint 
methods  in  the  following  subsections. 

2.1.  The  first  method 

Assume  that  Xi, . . . ,  Xn  are  independent  and  identi¬ 
cally  distributed  with  the  moment  generating  function 
existing  in  a  non-trivial  open  interval  containing  the  ori¬ 
gin.  Further  assume  that  we  are  interested  in  approx¬ 
imating  quantiles  of  the  distribution  of  statistic  T.  In 
particular,  we  consider  here  T  =  X,  the  sample  mean. 
More  gener2d  cases  will  be  discussed  in  later  sections. 

Let  ^  =  JS?(Xi)  and  K(i)  =  lnM(t)  be  the  cumulant 
generating  function  of  Xi  -  fx.  The  Lugannani  &  Rice 
(1980)  saddlepoint  formula  for  approximating  Fn{x)  = 
Pr(X  —  fJL  <x)  is  defined  as 

G„(x)  =  -  ^).  (1) 

TX/gj 

where 

Wx  —  8gn(tgj)[2fi’{fxaj  •^(^«)}']  ^ ) 

Zx^tAnK"{ix)}^, 
tx  is  the  solution  to 

K\t)  =  (2) 

and  ^  and  ^  are  the  standard  normal  cumulative  dis¬ 
tribution  and  density  functions,  respectively.  The  rel¬ 
ative  error  in  G^n(®)  is  of  order  0(n”^)  in  a  large  de¬ 
viation  region  and  in  the  shrinking  set  of 

{x  :  1®  —  <  cl^/n\  for  any  fixed  c  >  0. 

Given  a  E  (0, 1)  let  ««  be  the  ath  quantile  of  Fn,  i.e., 
Fn{xot)  =  a.  It  is  readily  shown  that  the  corresponding 
saddlepoint  quantile  Xga  ^  G~^(a)  satisfies 

4- 0(n-3/2)}  (3) 

for  any  fixed  a  as  n  oo.  There  are  methods  available 
to  obtain  numerically,  but  they  are  all  hardly  work¬ 
able  on  a  desk  top  calculator  in  general.  The  goal  here 
is  to  derive  a  simple  method  to  approximate  Xa  that  can 
be  calculated  on  a  calculator  quickly  and  easily. 

First  we  state  an  important  result  of  Jensen  (1992)  as 
follows.  Let 

r*  =  tii»  -  —  log(— ),  (4) 

Wx  Zx 


Then  we  have 

f(r:)  =  Gn(®){l  +  0(n-i)},  (5) 

and  the  error  holds  uniformly  for  ®  in  a  compact  set. 
Furthermore,  for  any  c  >  0  the  error  0(n“^)  can  be 
replaced  by  0(n“^^^)  for  |®  —  ^|  <  c/y/n.  This  result  is 
essentially  Lemma  2,1  of  Jensen  (1992)  and  an  elegant 
proof  of  this  is  given  there. 

The  basic  idea  of  this  paper  is  to  use  the  transformar 
tion  (4)  to  obtain  a  much  simpler  relationship  between 
a  and  an  approximate  ath  quantile  by  (5).  A  one-step 
procedure  based  on  this  relationship  is  described  as  fol¬ 
lows. 

Let  xja  be  the  x  value  such  that  r*  =  $“^(a).  It  is 
seen  from  (5)  and  (3)  that 

Xja  =  “I"  0(n~^^^)} 

=  ®a{l  +  0(n”"/2)}. 

With  a  reasonable  initial  value  «o  (see  (9),  for  exam¬ 
ple),  we  want  to  get  a  one-step  approximation  xi  for 
xjai  and  therefore  for  Let 

®i  =  ®o  H"  A®o>  (^) 

where  A®o  is  a  small  adjustment  given  in  (8).  By  (4)  for 

X  -  Xja 

(r:y  =  «,2-2ln(^)  +  {— ln(^)}2 

'  Zx  tVx  Zx 

=  d, 

where  d  =  {$"“^(a)}^.  Notice  that  for  fixed  a  the  last 
two  terms  of  (r*)^  are  of  order  at  most  0(n“^/^).  There¬ 
fore,  for  X  -- 11^  and  thus  tx  =  0(n““^/^)  we 

have 

^(r:)^  =  ^t"^+0(l)  =  2n<.  +  0(l). 

Hence,  the  expansion  of  (r*  at  ®  =  ®o  is 

(^«o)^  +  2nta;o  A®o  +  0(Axo)y  (7) 

where  txo  is  the  solution  to  (2)  for  x  =  xq.  Setting  (7) 
to  be  d  we  obtain 

A®o  ={d-  {rlJ^}/{2ntxo).  (8) 

Let  Qa  =  ^'~^{o^)(7/^/n  and  be  the  corresponding  so¬ 
lution  to  (2)  for  x  =  (when  is  not  in  the  domain  of 
®,  a  suitable  substitute  is  needed),  where  =  Var(Xi). 
We  suggest  to  use  the  following  initial  value 

®o  =  ?a  +  {d  - 


(9) 
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When  the  solution  to  (2)  can  be  solved  analytically 
in  terms  of  x  our  one-step  saddlepoint  approximation 
«i  can  be  calculated  easily  on  a  calculator.  Note  that 
in  constructing  confidence  intervals  or  testing  hypothe¬ 
ses  we  are  mostly  interested  in  the  quantiles  in  the  two 
tails  rather  than  around  the  center  of  the  distribution, 
so  that  y/rit^^  is  bounded  away  from  zero.  Note  that 
the  first  and  second  terms  in  (9)  axe  of  order 
and  G(n“"^),  respectively.  With  this  initial  value  we  have 
that  Axo  in  (8)  is  of  order  since 

+0(»-‘)}  =  0(n-') 

Moreover,  by  expanding  at  »  =  xq  (first  equation 

below)  and  at  «  =  (second  equation  below)  respec¬ 
tively  as  above,  we  obtain  that 

d+0(Axo)  =  ir:y  =  {r:,y 

+2ntaj„(j!i  -  Xja)  +  0{xi  -  Xja). 

From  the  first  equation  we  have  that  d  —  = 

Since  =  d,  it  is  seen  that  — 

xja)  =  0{n~^l^)  and  thus 

*1  =  »Ja{l  +  0(n-3/2)}  =  *„{1  +  C)(„-3/2)}.  (10) 

Our  experience  shows  that  the  one-step  approximation 
is  numerically  accurate  enough  for  most  applications.  Its 
numerical  accuracy  is  illustrated  in  Section  4. 

In  the  case  where  more  accuracy  is  desirable,  we  could 
repeat  the  one-step  method  by  using  the  previous  ap¬ 
proximation  Xi^i  as  the  initial  value  of  the  current  step: 

Xi  —  “h  1»  i  =  2,  3,  • . (^^) 

where 

The  series  {xj}  usually  converges  quickly  to  xja  as  i 
increases.  When  i  =  2,  it  is  seen  that  (rj^)^  =  (r*  4* 

2ntxxAxx  -h  O(Axi)  =  d  +  0(n“'^).  Using  the  same 
expansion  and  by  induction  we  have 

d =0(n-(’+i)/2) 

for  t  =  2, 3, . . ..  Thus,  the  argument  leading  to  (10)  may 
be  used  again  so  that 

<i+0(A.,..)  =  = 

+2n<a,j^(®i  -  xja)  +  0{Xi  -  Xja), 

which  implies  that 

®<  =  *Ja{l  +  0(n-‘/2-i)}. 


The  fast  convergence  is  evident  in  the  examples  in  Sec¬ 
tion  4. 

2»2.  The  second  method 

The  solution  to  (2)  is  often  not  in  a  simple  analytic 
form  and  can  not  be  easily  solved  numerically.  In  such 
cases,  one-step  formula  (6)  may  not  be  very  convenient 
and  thus  we  suggest  a  slightly  different  version  as  follows. 

Let  tjoL  =  solution  to  (2)  when  x  = 

Instead  of  approximating  xj^  directly  as  in  (6)  and  (11) 
we  could  approximate  t  first  and  then  get  the  corre¬ 
sponding  approximation  for  x  j^.  Since  r*  in  (4)  can  also 
be  viewed  as  a  function  of  t,  we  rewrite  it  as  r*(t).  For 
X  =  we  now  define  =  qa/<T^  (when  it  is  not  in  the 
domain  of  t,  use  a  reasonable  substitute).  The  following 
initial  value  for  tja  is  suggested 

A  small  adjustment  At©  corresponding  to  Axq  in  (8)  is 
obtained  as 

A<o  =  [d-{r*(to)y]/{2nioK"iio)}. 

Thus  the  one-step  approximation  for  tja  is  given  by 

=  fo  +  Afo,  (13) 

and  the  corresponding  one-step  approximation  for  xja 
(and  thus  for  x^)  is  Computational  savings  may 

be  compounded  to  use  this  method  in  complicated  cases 
where  more  computational  efforts  are  involved  to  calcu¬ 
late  K{t)  and  its  derivatives. 

Similarly,  further  steps  {t^}  for  ija  and  {if'(ti)}  for 
Xa  can  be  obtained  by  using  (13)  with  initial  value 
It  can  be  shown  that  K'{ii)  and  x*  have  the  same  order 
of  the  relative  error  for  xj^,  i.e., 

for  i  =  1,2,,...  The  proof  is  similar  to  that  leading 
to  (10)  and  (12),  and  is  thus  omitted  here.  The  second 
example  in  Section  4  demonstrates  the  use  of  this  second 
approach,  among  other  things. 

3,  SADDLEPOINT  QUANTILES  OF 
CONDITIONAL  DISTRIBUTIONS 

The  simple  idea  considered  in  Section  2  can  be  ex¬ 
tended  to  other  saddlepoint  approximations.  To  be  more 
specific,  however,  in  this  section  we  concentrate  on  the 
case  of  approximating  conditional  distributions  explored 
by  Skovgaard  (1987).  The  results  of  that  paper  are  of 
special  importance  and  are  widely  used  in  various  prob¬ 
lems,  but  as  in  other  saddlepoint  methods  there  has  been 
no  quick  way  to  obtain  the  corresponding  quantiles. 


(12) 
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Let  (JTi,  Yi), . . . ,  (Xn,  Y;»)  be  independent  and  identi¬ 
cally  distributed  continuous  random  p-dimensional  vec¬ 
tors  with  the  cumulant  generating  function  K{u,v)  ex¬ 
isting  in  a  neighborhood  of  the  origin,  where  Yi  and  v 
are  (p  —  l)-dimen8ional.  Let 

ka{u,v)=  ■^K{u,v), 

Kab(,u,v)  = -^^K{u,v) 

for  a,  b  being  either  u  or  v.  Denote  the  sample  mean 
of  by  Skovgaard  (1987)  derived  a 

saddlepoint  expansion  for  the  conditional  distribution 
P(X  —  %  <  *1^'  =  y)  in  the  same  form  as  in  (1), 

Gn(*l3/)  =  - 7^).  (14) 

^x\y  ^x\y 

where 

u’slv  =  sgn(u»)(2n[u»(M»  +  ®)  -|-  v^y 

-K{ux,Vx)  -  {voy  -  /^(O, 

Vo  and  (u^,  v^)  are  the  solutions  to  the  saddlepoint  equar 
tions 

^v(0>vo)  =  y»  voeR 

and 

kn{ux,v^)  =  Hy  +  x,  kviux,v^)  =  y, 

(«®,  Vx)e-R'’, 

respectively.  Moreover,  Hy  =  J?'u(0,  vq).  and  |il|  denotes 
the  determinant  of  matrix  A, 

Letting 

^l\y  =  «»«|y  - 

^x\y  *x|y 

and  using  Lemma  2.1  of  Jensen  (1992),  one  can  show 
that  (5)  is  true  for  our  new  r*|j^  and  Gn(aj|y).  To  ap¬ 
proximate  the  ath  quantile  aj^iy  of  P{X —y>y<x\Y  =  y) 
we  first  define  the  corresponding  saddlepoint  quantile  as 
Xja\y  such  that  =  a. 

We  can  now  construct  a  one-step  approximation  aji|y 
as  in  Section  2.  Let  the  initial  value  be 

*0|v  =  g«  +  {4  -  (»’J,|y)“}/(2nw,„). 

Since  ^(io*|y)*  =  2n«*,  we  have  the  adjustment  analo- 
gous  to  (8): 

A®0|y  =  {4-  (r;„^)^}/(2nwx„,J. 


The  one-step  approximation 

®l|y  =  ®0|y  +  A»o|y  (16) 

is  accurate  up  to  order  0(n“^/^)  as  before. 

As  in  Section  2,  further  steps  «i|y  (t  =  2, 3, . . .),  when 
needed,  arc  easily  defined  as  well  as  computed  to  im¬ 
prove  the  accuracy  until  the  convergence  to  the  limit 
®/a|y  The  same  relative  error  rate  of  0(n“**/^~^)  also 
holds.  Furthermore,  the  second  one-step  method  (13) 
can  be  readily  extended  analogously  to  the  conditional 
distribution  problem.  We  omit  the  details. 

4.  EXAMPLES 

In  this  section  we  consider  two  examples  to  illustrate 
how  our  one-step  methods  are  quickly  implemented  and 
the  numerical  accuracy  they  obtain. 

Example  1.  Suppose  that  J!ri,...,An  is  a  sample 
drawn  from  the  gamma  distribution  G(^,)S),  i.e., 

and  r  =  ^  2r=si  computationally  convenient  to 

make  comparisons  in  this  example,  since  the  exact  dis¬ 
tribution  of  T  is  known  to  be  G{n6^n0).  The  cumulant 
generating  function  of  Xi  —  /x  is 

with  the  solution  to  (2)  —  ^/(®  +  ^//?)- 

To  illustrate  the  performance  of  the  one-step  method 
defined  in  (6),  (8)  and  (9)  we  take  9  =  0.1,  /3  =  1,  a 
somewhat  extreme  case.  Furthermore,  let  the  sample 
size  n^e  as  small  as  10.  This  choice  is  also  convenient 
since  X  here  has  an  exponential  distribution  with  mean 
0.1,  so  that  the  ath  quantile  of  X  is  in  fact  explicitly 
given  by  ««  =  —  log(l  —  a)/10.  On  the  other  hand, 
this  particular  case  is  representative  of  the  whole  gamma 
family  for  the  comparisons  of  several  methods. 

In  Table  1,  the  one-step  approximation  is  compared 
with  other  approximations.  Note  that  when  <  0  we 
replaced  it  with  0.01.  It  is  seen  that  xi  is  close  enough 
to  saddlepoint  approximation  zja  and  thus  to  exact  Xa 
for  most  practical  purposes.  In  some  cases,  a  few  more 
steps  may  be  desirable  and  they  are  easily  implemented. 

Example  2*  This  example  is  to  demonstrate  that  one- 
step  methods  are  also  useful  in  finite  population  prob¬ 
lems.  Let  N  be  the  population  size  and  Mi{t)  be  the  mo¬ 
ment  generating  function.  A  random  sample  Xi, . . . ,  Xn 
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a 

exact 

one-step 

2nd  step 

X2 

exact  SA 

normal 

two-term 

Cornish-Fisher 

0.005 

0.00050 

0.00036 

0.00046 

0.00048 

-0.1576 

0.0329 

0.01 

0.00101 

0.00098 

0.00098 

0.00098 

-0.1326 

0.0247 

0.05 

0.00513 

0.00489 

0.00502 

0.00504 

-0.0645 

0.0120 

0.1 

0.0105 

0.0104 

0.0104 

0.0104 

-0.0282 

0.0123 

0.2 

0.0223 

0.0212 

0.0219 

0.0222 

0.0158 

0.0208 

0.8 

0.1609 

0.1674 

0.1589 

0.1618 

0.1842 

0.1597 

0.9 

0.2303 

0.2312 

0.2315 

0.2315 

0.2282 

0.2305 

0.95 

0.2996 

0.2995 

0.3014 

0.3011 

0.2645 

0.3017 

0.99 

0.4605 

0.4600 

0.4630 

0.4626 

0.3326 

0.4694 

0.995 

0.5298 

0.5294 

0.5325 

0.5322 

0.3576 

0.5428 

Table  1:  Approximations  to  quantiles  of  T  =  X  in  Gamma(0.1,  1)  case;  n  —  10 
(n  <  N)  is  drawn  from  the  population.  We  wish  to  ap-  It  is  obtained  that  fj,  =  8.8230.  Table  3  lists  approx- 


proximate  the^uantiles  of  X  —  /z.  Notice  that  the  distri¬ 
bution  of  the  X  is  generally  significantly  different  from 
that  of  the  mean  of  a  sample  drawn  with  replacement. 
The  bootstrap  is  based  on  the  latter  sampling  scheme. 
See  Davison  Sc  Hinkley  (1988)  for  an  interesting  account 
of  saddlepoint  approaches  in  this  area  where  our  one- 
step  methods  are  applicable  as  in  the  infinite  population 
case. 

The  one-step  methods  described  in  Section  2  may  be 
applied  to  our  current  problem  with  some  modifications. 
Let  Kn{t)  be  the  cumulant  generating  function  of  X  — 
/i.  Then  Kn{i)  can  be  expressed  in  terms  of  Mi{jt/n) 
(j  =  1, 2, , . n)  and  computed  recursively.  Moreover, 
if  we  use  Rn{t)  =  ^Ki{nt)  to  replace  iir(f)  in  Section 
2,  then  with  a  negligible  error  due  to  discreteness,  the 
saddlepoint  approximation  is  still  valid  as  n,iV  — oo 
and  n/N  <  d  <  1  for  some  constant  d.  These  results 
have  been  given  in  Wang  (1993). 

We  now  compare  the  one-step  approximation  obtained 
from  (13)  with  the  exact  qu£mtiles  and  other  approxima^ 
tions.  To  carry  out  the  comparisons,  the  following  pop¬ 
ulation  with  iST  =  36  was  simulated  from  an  exponential 
distribution: 


4.295 

34.636 

11.204 

3.041 

3.694 

0.570 

31.024 

11.245 

8.591 

12.745 

6.568 

3.913 

18.615 

6.332 

6.841 

0.905 

13.610 

14.981 

10.103 

2.210 

0.765 

5.056 

7.038 

1.849 

1.594 

18.450 

1.591 

6.656 

22.752 

12.753 

0.790 

5.005 

7.418 

11.321 

5.631 

3.834 

Table  2:  Population  for  the  simulation  in  Example  2 


imations  to  quantiles  of  X  —  /i  with  n  =  5.  It  is  seen 
that  the  one-step  method  provides  good  approximations 
for  the  quantiles.  Explicit  formulas  for  Cornish-Fisher 
type  approximations  in  the  finite  population  case  are 
not  available.  Thus,  they  are  not  given  in  the  table. 


a 

‘exact’ 

one-step 

K'iU) 

2nd  step 

exact  SA 

Xja 

normal 

0.005 

-6.253 

-6.126 

-6.283 

-6.296 

-8.551 

0.01 

-5.898 

-5.810 

-5.935 

-5.945 

-7.722 

0.05 

-4.766 

-4.707 

-4.781 

-4.788 

-5.460 

0.1 

-4.020 

-3.958 

-4.027 

-4.034 

-4.254 

0.2 

-2.981 

-2.879 

-2.945 

-2.957 

-2.794 

0.8 

2.863 

2.908 

2.841 

2.840 

2.794 

0.9 

4.630 

4.735 

4.628 

4.623 

4.254 

0.95 

6.113 

6.242 

6.123 

6.123 

5.460 

0.99 

8.961 

8.996 

8.933 

8.929 

7.722 

0.995 

9.940 

9.976 

9.933 

9.929 

8.551 

Table  3:  Approximations  to  quantiles  of  T  =  X  —  /i  in 
Example  2  with  n  =  5.  ‘Exact’  distribution  based  on 
100,000  simulated  samples 

5.  Concluding  Remarks 

In  this  note  we  have  proposed  two  simple  one-step  sad¬ 
dlepoint  methods  for  distribution  and  conditional  distri¬ 
bution  quantiles.  We  have  also  discussed  their  applica¬ 
tions  in  finite  population  problems  which  are  common  in 
survey  sampling  and  other  areas.  The  one-step  approxi¬ 
mations  are  easily  computed  on  a  pocket  calculator  once 
the  cumulant  generating  function  is  available.  Again, 


we  stress  that  in  presenting  the  new  methods  here,  their 
simplicity  has  been  the  main  objective.  It  is  indeed  an 
important  step  to  make  great  methods  easily  usable  to 
attract  the  interest  of  practitioners  in  using  them. 

The  new  methods  were  developed  in  the  cases  where 
sample  means  are  the  statistics  under  consideration. 
However,  since  Jensen’s  (1992)  original  theoretical  re¬ 
sults  that  we  have  applied  here  are  valid  for  many  other 
more  complicated  problems,  the  one-step  methods  can 
be  obtained  similarly  in  those  cases.  Finally  we  note 
that  our  experience  reveals  that  while  the  second  method 
(13)  is  often  more  convenient,  for  it  does  not  require  to 
solve  (2),  it  is  generally  slightly  less  accurate  than  the 
first  method  (6).  This  is  because  extra  approximations 
are  involved  in  the  second  method.  Therefore  we  recom¬ 
mend  using  the  first  method  whenever  the  solution  to 
(2)  is  easily  calculable. 
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We  consider  robust  and  resistant  estimation  of  the 
covariance  matrix  of  multivariate  data.  Using  expansions 
for  the  influence  of  individuai  obsen/ations  on  the 
eigenvalues  and  eigenvectors  of  the  covariance  matrix 
derived  by  Crttch!ey(1985),  we  develop  an  Eigenstructure 
Influence  (ESI)  method  for  estimating  the  covariance 
matrix.  Motivated  by  lower  rank  approximations  and 
graphical  representations  of  covariance  matrices,  we 
extend  the  influence  measures  developed  by  Critchley  to 
measure  the  influence  individual  observations  may  have 
on  these  approximations.  We  develop  a  downweighted, 
iterative  estimation  algorithm  estimating  the  covariance 
matrix  directly  using  the  ESI  influence  measure.  We 
illustrate  the  technique  with  sample  data  sets  and  lower 
rank  biplot  graphics. 

KEY  WORDS:  Eigenvalues;  Eigenvectors;  Influence 
functions 


1.  INTRODUCTION 

We  consider  a  robust  and  resistant  procedure  to  estimate 
the  covariance  matrix  of  multivariate  data  based  upon  the 
eigenstructure  influence  functions.  The  motivation  behind 
the  use  of  influence  functions  is  that  they  measure  the 
amount  of  change  in  parameter  estimates  at  a  point  x  e 
Outlying  points  x  do  not  necessarily  unduly  influence 
parameter  estimates,  while  conversely,  non-outlying 
points  with  large  influence  may  change  parameter 
estimates  by  nature  of  their  orientation.  Distributional 
properties  of  the  sample  influence  functions,  graphical 
methods  for  the  identification  of  influential  points,  using 


influence  of  both  the  eigenvalues  and  the  eigenvectors, 
and  specific  application  to  biplots  and  other  lower  rank 
applications  are  discussed  by  Vetter  (1992).  The 
procedure  will  be  referred  to  as  the  Eigenstructure 
influence  (ESI)  procedure.  Its  performance  is  compared 
with  other  methods  available  in  the  literature. 


2.  MOTIVATION 

Exploratory  and  robust/resistant  techniques  are  becoming 
a  more  widely  accepted  component  of  statistical  practice. 
Estimation  based  upon  a  sample  of  points  is  often 
enhanced  by  the  use  of  robust/resistant  methods.  These 
methods  attempt  to  ameliorate  the  problem  of  distortion 
of  the  underlying  structure  of  the  main  body  of 
observations  by  a  small  number  of  observations.  In 
addressing  this  problem  there  have  been  many  creative 
and  innovative  ideas  presented  to  handle  what  is 
frequently  called  contamination.  Contamination  can  take 
many  forms  but  basically  can  be  defined  as  any  point  or 
set  of  points  which  unduly  influence  the  outcome  of  an 
analysis  or  investigation.  Specifically,  in  the  estimation  of 
dispersion  matrices  for  a  set  of  multivariate  vector 
observations,  there  have  been  numerous  schemes  to 
ameliorate  the  effects  of  outlying  points,  roundoff  errors 
and  other  forms  of  distortion  which  can  be  present  in  any 
set  of  observed  data.  Many  of  these  schemes  involve  the 
use  of  Mahalanobis  distance  which  measures  the  elliptical 
distance  from  a  multivariate  centroid.  Jolliffe  (1986)  has 
suggested  that  the  influence  functions  on  both  the 
eigenvectors  and  the  eigenvalues  be  used  instead  of  the 
Mahalanobis  distance.  The  ESI  procedure  provides  a 
systematic  method  of  using  both  these  influence 
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functions,  as  Jolliffe  has  suggested. 

The  classical  estimator  for  dispersion  matrices  is  the 
maximum  likelihood  estimator  (MLE),  which  has  optimal 
properties  when  the  data  observed  are  from  a  multivariate 
normal  distribution  with  a  given  mean  vector  and 
covariance  matrix.  The  maximum  likelihood  estimator  may 
not  possess  these  optimal  properties,  however,  when 
contamination  is  present  in  the  data. 

in  any  analysis,  one  must  determine  whether 
contamination  is  present  or  not.  For  multivariate  data  it 
can  be  extremely  difficult  to  identify  the  contaminated 
portion  of  data.  The  problem  is  particularly  acute  In  small 
data  sets,  where  the  Identification  of  the  population 
distribution  may  be  extremely  difficult. 

3.  MAHALANOBIS  DISTANCE 

The  Mahalanobis  distance,  as  presented  in  the 
following  discussion,  has  been  the  basis  of  most 
multivariate  estimation  procedures  to  date.  The 
Mahalanobis  distance  quantifies  In  a  scalar  measure  the 
elliptical  distance  from  a  vector  centroid.  Once  the 
centroid  vector  is  determined,  the  distance  of  a  point  in  p 
dimensions  is  measured  and  is  weighted  by  the  Inverse 
of  the  covariance  matrix.  Thus,  points  which  lie  along  an 
axis  with  a  large  amount  of  variation  may  have  the  same 
elliptical  distance  as  those  points  which  lie  closer  in 
Euclidean  distance  to  the  centroid  but  lie  along  an  axis 
with  less  variation.  This  measure  has  been  proven  to  be 
an  effective  measure  of  outlyingness  and  has  been  used 
very  successfully  in  the  iterative  procedures  to  detect 
outlying  points. 

The  ESI  procedure  Is  based  upon  the  influence 
functions  of  the  eigenvalues  and  eigenvectors  of  the 
covariance  matrix  which  measure  the  rate  of  change  in 
the  eigenvalues  and  eigenvectors  at  a  point  x.  It  will  be 
shown  that  these  rates  of  change,  though  related  to 
Mahalanobis  distance  provide  more  information  and  thus 
can  be  used  to  improve  the  estimation  of  covariance  in 
the  presence  of  contamination.  The  Mahalanobis  distance 
is  defined  as: 


where  it  is  assumed  that 

XrNivi.Q) 


and 


To  date,  techniques  have  been  primarily  focused 
on  outlyingness  of  points  and  not  on  the  influence  of 
points  on  particular  parameters.  Since  there  are  many 
forms  of  data  contamination,  the  proposed  procedure  is 
offered  as  an  alternative  approach  which  considers  all 
forms  of  contamination,  I.e.  any  data  which  distort  the 
parameters  of  interest. 

2.  THE  ESI  METHOD 

It  has  been  shown  that  the  Mahalanobis  distance 
is  an  increasing  function  of  the  eigenvalue  influence 
functions  of  the  covariance  matrix  (see  Vetter  (1992)). 
The  existing  methods,  which  use  Mahalanobis  distance 
are.  In  effect,  using  only  the  influence  on  the  variation 
(eigenvalue),  but  not  the  influence  on  the  eigenvector.  To 
see  why  the  consideration  of  the  influence  of  the 
eigenvector  as  well  as  the  influence  on  the  eigenvalue  is 
important,  consider  the  following  decomposition  of  a 
covariance  matrix  S: 

p 

j=i 

where  aj  are  the  eigenvectors  and  the  eigenvalues  of 
S.  It  is  logical  therefore,  to  consider  the  influence  of  a 
point  on  this  particular  combination  of  both  the 
eigenvalues  and  the  eigenvectors. 


(Xj-fi) 
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The  rank  r  approximation  to  S  is 


Yij  = 


Therefore,  the  influence  on  the  vector  functionals 


is  the  jth  principal  component  score  for  the  ith  point. 

We  have  developed  the  following  equation  for  the 
perturbed  vector  V^e,)  a(s,) 


fK]  :/  =  1  to  p 

can  be  used  to  develop  a  metric  which  measures  the 
change  in  the  covariance  matrix  S  or  any  lower  rank 
approximation  to  S. 


yCCTTcTT  aj(€i) 


From  Critchley  (1985),  we  have  the  following  The  empirical  influence  curve,  which  measures  the 
perturbed  parameters  where  Sj  represents  the  amount  of  rate  of  change  in  a,  at  the  ith  point  can  now  be 

contamination  in  the  jth  eigenvalue  and  eigenvector  at  a  written  as  the  following  vector: 

point  X, : 


=  A^+€jVjj+-i€5ltij+0  (el) 


«i(€i)  =  «i+€ipjj+^elyi^+0  (el) 


It  should  be  noted  from  the  form  of  these  influence 
functions  that  there  are  two  independent  components. 
The  first  component 


where  for  each  point  i,  the  influence  function  or  relative 
rate  of  change  in  the  Jth  functional  (e, )  is 

^13  ~  Yij^  ~  ^3 


— 


will  be  large  when  a  point  is  outlying  in  the  jth  direction. 
The  second  component 


and  the  influence  function  for  the  functional  aj(8j )  is 


Pij 


where 


will  be  large  when  a  point  is  influential  on  the  direction  of 
the  jth  principal  component  direction.  When  p  is  large, 
the  second  part  may  dominate  the  value  of  the  influence 
function.  Since  v^  is  independent  of  p^,  points  which  are 
not  outlying  with  respect  to  elliptical  distances  may  yet  be 
influential  for  a  particular  functional. 
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In  order  to  use  these  influence  functions  in  a  weighted 
estimation  procedure  we  need  to  convert  the  vectors  to 
some  scalar  measure  of  influence.  A  natural  scalar 
metric  for  measuring  change  to  S  or  any  lower  rank 
approximation  to  S  is  the  following: 

dVi 

pi 

where  r  is  the  rank  of  the  estimated  matrix  and  Cj  are 
scaling  factors. 

A  scaling  factor  which  allows  a  direct  comparison  with  the 
Mahalanobis  distance  is: 


We  then  have,  a  comparison  of  the  functional  form  of  dv 
to  the  Mahalanobis  distance  dm  as  follows: 

(xj-x)  'Q-‘  (x,-x> 


portion  of  the  d  values  but  gives  decreasing  weights  as 
the  scores  Increase  beyond  a  specific  cutoff  point  k. 

Let  w,  =  1  if  dv,  <  k 

w,  =  k/  dv,  If  dv,  >  k 

For  an  iterative  procedure,  it  is  necessary  to  extend 
Critchley's  equations  to  determine  the  influence  of  a  point 
on  a  weighted  estimate  of  the  covariance  matrix. 
Critchley's  equations  reflect  the  influence  of  a  point  as  the 
difference  between  the  covariance  matrix  with  the  point 
included  and  with  the  point  deleted.  In  our  iterative 
procedure  the  influence  of  a  point  is  reflected  as  the 
difference  between  giving  a  point  weight  w„  where  0  <  w, 
<  1,  and  giving  it  weight  0. 

Let 

12 

in  =  g  VjXj 

denote  a  weighted  mean  and 


n 


denote  a  weighted  covariance  matrix.  Then  the  weighted 
average  with  the  ith  point  downweighted  is  denoted  by 


®(i) 


)  (Xj-m) 


If  a  point  has  a  large  Mahalanobis  distance  it  will  also 
have  a  large  dv  score,  due  to  the  first  part  which 
measures  influence  on  the  eigenvalue.  The  dv  score  also 
includes  the  influence  on  the  direction  of  the  eigenvectors, 
thus  protecting  against  influential  but  not  necessarily 
outlying  points. 

4.  ESTIMATION  PROCEDURE; 

Any  M-type  estimation  procedure  may  be  used 
with  weights  based  upon  the  dv  metric.  For  example,  an 
accepted  and  frequently  used  method  is  to  apply  a  Huber 
influence  function  which  gives  equal  weight  to  the  middle 


(jn-WjX^) 

(1-wi) 


The  empirical  distribution  function  after  downweighting  the 
ith  point  becomes 


{1+ 
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we  then  get  the  downweighted  covariance  matrix 


o  (i'u,)  •Q(P)  -  {xrm)  (xrm)  '-0  {#) } 


Thus,  W|  /  (1  -  W|)  replaces  e,  in  the  previous  equations. 
We  have  also  shown  the  equations  for  the  downweighted 
eigenvalues  and  eigenvectors  and  their  corresponding 
Influence  function.  If  and  represent  the  weighted 
eigenvalue  and  eigenvector  then  the  formulas  for  Vy  and 
Py  are  the  same  as  deleted  influence  with  *  (X|-m)'  <x^. 

5.  RESULTS 

It  has  been  shown  (  see  Vetter,  1992)  that  the 
ESI  procedure  consistently  outperformed  several  of  the 
well  known  procedures  based  upon  the  Mahalanobis 
distance.  The  bias  and  mean  squared  error  in  estimating 
covariances  and  eigenvalues  using  the  ESI  procedure 
was  compared  to  two  M-type  estimators  (see  Maronna, 
1976)  and  ( Campbell,  1980)  both  based  upon  ellipsoidal 
distances.  ESI  performed  as  well  as  the  two  procedures 
when  contamination  existed  in  the  form  of  outliers  In  the 
direction  of  the  principal  component  axes,  but  provided 
visibly  better  results  when  contamination  was  present 
between  the  axes  where  point  have  more  influence  on  the 
eigenvectors. 

Another  advantage  of  the  ESI  procedure  is  that 
for  problems  of  less  than  full  rank,  the  construction  of  the 
proposed  procedure  facilitates  improved  estimation  by 
exclusion  or  downweighting  of  the  Influence  of  points  on 
directions  not  included  in  the  analysis.  As  an  example,  in 
situations  where  the  relationships  of  the  minor  principal 
components  are  of  Interest,  the  influence  of  points  on  only 
these  last  few  components  with  the  smallest  variation 
need  be  considered.  For  graphical  procedures  such  as 
the  biplot,  a  robust/resistant  biplot  can  be  obtained  by 
using  the  ESI  method  with  scaling  factors  c,  and  Cj  =  1 


and  c,  =  0  for  r  >  2.  This  bases  the  weights  on  the 
influence  of  points  on  the  1st  2  principal  components 
upon  which  the  biplot  Is  based,  (see  Vetter,  1992). 

The  proposed  procedure  can  be  applied  to 
correlation  matrices  by  substituting  the  formulas  for 
influence  of  vector  observations  on  the  correlation  matrix, 
which  have  been  developed  by  Calder  and  described  in 
Jolllffe  (1986).  This  would  be  particularly  beneficial  for 
principal  components  based  upon  the  correlation  matrix, 
since  it  has  been  shown  that  points  which  are  influential 
for  the  covariance  matrix  need  not  be  particularly 
influential  for  the  correlation  matrix.  This  is  in  part  due  to 
the  fact  that  for  a  correlation  matrix  the  eigenvalues  sum 
to  the  number  of  variables  in  the  observation  vector.  An 
investigation  could  be  made  of  those  points  where  large 
influence  is  indicated  for  either  covariance  or  correlation 
but  not  the  other. 
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Abstract 

Prior  to  performing  an  analysis  on  data,  the  statistician 
must  address  the  problem  of  outliers.  One  way  to  do  this  is 
to  use  the  Minimum  Volume  Ellipsoid  (MVE)  estimator, 
which  has  desirable  robustness  properties  due  to  its 
breakdown  point  of  50%.  It  is  defined  as  the  ellipsoid  of 
minimum  volume  that  covers  half  of  the  data  points.  A 
major  problem  with  using  the  MVE  is  that  few 
computationally  attractive  methods  exist  fw  its  calculation, 
especially  in  high  dimensions  and  fw  large  sample  sizes. 
Determining  the  MVE  consists  of  two  parts.  The  first  is  to 
find  the  correct  subset  of  points  used  to  calculate  the  MVE, 
and  the  second  is  to  find  the  ellipse  that  covers  this  set. 
Finding  the  subset  of  points  to  be  covered  by  the  MVE  will 
be  addressed  in  this  p^r.  The  solution  proposed  here  is  to 
use  the  Effective  Independence  Distribution  (EID)  method 
which  chooses  the  subset  by  minimizing  detenninants  of 
matrices  based  on  the  data.  Results  show  that  the  volume 
of  the  ellipse  using  the  EID  subset  of  points  differs  from  the 
optimal  by  less  than  6%  for  some  regression  data  sets 
where  the  true  MVE  is  known. 

1.  Introduction 

The  existence  of  outliers  and  how  to  deal  with  them  is 
an  important  problem  in  statistics.  The  MVE  was  first 
proposed  as  a  robust  estimator  of  location  and  shape  by 
Rousseeuw  [1],  but  its  use  has  been  hampered  by  the  lack  of 
a  computationally  feasible  means  of  calculating  it.  The 
MVE  is  defined  as  the  ellipsoid  of  minimum  volume  that 
covers  approximately  half  of  the  points  in  a  data  set.  From 
this  one  can  see  that  it  is  a  configuration  of  high  content, 
but  minimum  volume. 

The  problem  of  finding  the  MVE  is  two-fold.  One  must 
first  find  the  subset  of  points  that  should  be  covered  by  the 
ellipsoid  and  then  weight  the  data  such  that  these  points  are 


covered  by  it.  A  solution  fcnr  finding  the  weights  is 
described  in  Hawkins  [2].  Several  methods  have  been 
described  in  the  literature  to  find  the  subset  of  points.  The 
first  was  the  basic  resampling  method  suggested  by 
Rousseeuw  and  Leroy  [3].  Subsequent  methods  that  have 
been  developed  include  the  Feasible  Solution  Algorithm 
(FSA)  by  Hawkins  [2],  and  some  heuristic  search 
algorithms  are  described  in  Woodruff  and  Rocke  [4]. 
These  authors  compare  the  resampling  or  undirected 
random  search  method  to  simulated  annealing,  genetic 
algorithms,  and  tabu  search.  All  of  these  mediods  are 
approximate  ones,  so  obtaining  the  exact  MVE  is  not 
guaranteed  for  a  finite  amount  of  resampling. 

This  paper  is  based  in  part  on  the  work  done  by  Hawkins 
[2].  We  propose  a  solution  to  the  subset  selection  problem 
called  the  EID  method.  Some  background  information  on 
the  MVE  estimator  is  provided,  and  the  EID  method  is 
described.  Results  are  presented  that  show  the  relative 
error  in  the  volume  of  the  ellipsoid  found  using  the  EID 
approach  for  several  regression  data  sets  where  the  true 
MVE  is  known. 

2.  Minimum  Volume  Ellipsoid  Estimator 

The  problem  of  robust  estimation  of  multivariate 
location  and  shape  is  that  given  a  set  of  n  observations  x., 
each  one  having  p  dimensions,  find  an  estimate  of  location 
and  shape  that  is  resistant  to  outliers  or  contaminated  data. 
The  MVE  is  one  such  estimator  and  it  is  given  by  the 
ellipsoid  [2] 

(x-cfr-\x-c)  =  p  (1) 

where  c  and  T  are  the  location  vector  and  scatter  matrix 
respectively  and  p  is  the  dimension  of  the  data.  The 
location  vector  is  a  weighted  mean  calculated  as 
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i=l 

and  the  covariance  or  scatter  matrix  is 

i=l 

where  x*  is  a  column  vector  denoting  the  ith  observation  in 
the  subset  of  h  points,  Wj  is  the  weight  for  the  ith 
observation,  and  h  =  [(rt  +  p+l)/2]  (the  brackets  denote 
the  greatest  integer  function).  The  volume  of  the  covering 
ellipse  will  be  proportional  to  the  determinant  of  F.  It  is 
evident  from  Eqs.  2-3  that  to  find  the  MVE  one  must 
determine  which  h  points  should  be  covered  and  the 
corresponding  weights  to  ensure  coverage  of  the  points. 

It  is  known  [1,2,4]  that  the  MVE  has  a  breakdown  point 
that  approaches  50%  as  the  number  of  points  in  the  data  set 
increases  which  is  the  best  one  can  have.  This  means  that 
approximately  half  of  the  data  can  be  arbitrarily 
contaminated  without  affecting  the  estimate. 

The  algorithm  used  in  this  paper  to  find  the  weights  is 
one  developed  by  Titterington  [5]  and  is  also  used  by 
Hawkins  [2].  All  of  the  weights  are  initially  set  to 
wf^  =  l/h,  i  =  l,...,h  which  is  just  the  usual  weights 

given  to  points  when  calculating  the  sample  mean  of  a  data 
set  of  size  h.  Then  at  each  iteration  k  calculate  the 
weighted  mean  and  covariance  from  Eqs.  2-3  and  the 
Mabalanobis  distances  for  each  observation  given  by 

=  (X*  -  (X*  -  (4) 

If  ^  p  for  evCTy  i,  then  the  current  ellipsoid  using 
andF^j)  is  the  MVE  covering  the  h  observations.  If 
the  Mabalanobis  distance  for  any  of  the  observations 
exceeds  p,  then  the  weights  must  be  adjusted.  They  are 
updated  using  the  following 

=  (5) 

P 

and  the  calculations  of  Eqs.  2-4  are  repeated  until  all  of  the 
distances  are  less  than  p.  This  procedure  enlarges  the 
ellipsoid  until  all  of  the  h  points  are  covered. 

The  algorithm  for  finding  the  weights  can  be  somewhat 
computationally  intensive  for  some  data  sets.  However,  it 
should  be  apparent  that  the  real  computational  burden 
arises  from  the  determination  of  which  points  should  be 


covered  by  the  ellipse.  The  EID  algorithm  is  presented  as  a 
means  of  addressing  this  problem. 

3.  Effective  Independence  Distribution 

3.1  Background 

Since  the  volume  of  the  minimum  covering  ellipse  is 
proportional  to  the  determinant  of  the  scatter  matrix  F,  one 
could  approach  this  problem  as  that  of  optimizing  the 
determinant.  In  this  application,  the  objective  would  be  to 
minimize  the  determinant  of  F.  This  provides  the 
motivation  for  using  the  EID  method,  since  it  can  be  shown 
that  deleting  points  based  on  their  EID  value  will  optimize 
the  determinant  of  the  Fisher  Information  Matrix  (FIM) 
defined  below.  Of  course,  the  FIM  is  not  exactly  the  same 
as  F  of  Eq.  3,  however  results  indicate  that  it  will  be  a 
reasonable  approximation. 

The  EID  vector  [6,7,8]  for  a  data  set  of  n  p-dimensional 
observations  is  calculated  using  the  following  equation 

EID  =  diag(X(X^XT^  X^)  (6) 

where  X  is  an  n  x  p  matrix  with  n»p  and  each  row 
contains  one  observation.  The  EID  is  just  the  diagonal 
elements  of  the  ‘hat’  matrix  which  is  familiar  from 
regression  theory.  Note  that  there  are  n  elements  in  the 
EID  vector,  one  corresponding  to  each  observation.  Finally 
notice  that 

XeID.=P  (7) 

i=l 

and  that 

0<E1D;<1  (8) 

which  can  be  shown  from  the  fact  that  the  matrix  in  Eq.  6 
is  idempotent.  The  matrix  X^X  is  called  the  FIM. 

It  has  been  shown  [7,8,9]  that  the  following  relationship 
between  the  determinants  of  the  FIM  holds  as  one 
observation  is  deleted  from  the  data  set 

I  X^X|_,  =  (1-ElDi)  I  X^X|  (9) 

where  the  determinant  on  the  left-hand  side  is  calculated 
with  the  ith  observation  removed  from  the  data  set,  the 
determinant  on  the  right-hand  side  contains  all  of  the  data. 
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Table  I.  Data  Set  Parameters 


Data  Set 

P 

n 

h 

Aircraft 

4 

23 

14 

Coleman 

5 

20 

13 

Delivery 

2 

25 

14 

Education 

3 

50 

27 

Gravity 

5 

20 

13 

Salinity 

3 

28 

16 

and  the  BID,  denotes  the  BID  value  for  the  ith  observation. 
From  this  one  can  see  that  there  is  a  direct  relationship 
between  the  determinants  as  the  points  are  removed.  Thus, 
if  the  situation  calls  for  minimizing  the  determinant  of  the 
FIM  then  it  is  obvious  from  Bq.  9  that  the  observation  with 
the  largest  BID  value  should  be  deleted. 

Two  things  should  be  noted  from  Bqs.  8-9.  If  an 
observation  is  deleted  that  has  a  value  of  zoo,  then  nothing 
is  lost  by  removing  that  point  If  an  observation  has  an  BID 
value  of  one,  then  that  point  cannot  be  removed.  If  such  a 
point  is  removed,  then  the  determinant  of  the  FIM  becomes 
zero  and  the  resulting  matrix  is  singular.  Thus,  an 
observation  with  a  value  of  one  must  be  retained  to  keep  the 
problem  at  full  rank  p. 

3,2  The  EID  method  of  subset  selection 

The  BID  values  can  be  used  to  successively  remove 
points  from  the  data  set  until  h  points  remain.  These  h 
points  will  then  be  used  with  the  algorithm  described  in 
Section  2  that  will  find  the  weights  and  the  resulting 
ellipsoid.  However,  to  better  approximate  the  matrix  F,  the 
data  will  be  centered  by  subtracting  the  p-dimensional 
sample  mean  from  each  observation.  This  is  repeated  as 
each  point  is  deleted.  The  procedure  consists  of  the 
following  steps: 

1.  Calculate  the  matrix 

where  is  the  set  of  raw  data  points  at  the  jth 
iteration  (at  iteration  j=0  there  are  n  points  in  the 
set,  at  iteration  j=\  there  are  n-1  points,  etc.)  and 
X  is  an  {n-j)  x  p  matrix  with  each  row  containing 
the  p-dimensional  sample  mean  for  the  current  set  of 
data. 


3.  Delete  the  point  that  corresponds  to  the 

maximum  BID  value. 

4.  Repeat  steps  1-3  until  only  h  points  remain. 

5.  Adjust  the  weights  until  the  h  points  are  covered. 

The  BID  tends  to  give  points  a  large  BID  value  if  they 
have  large  magnitudes.  However,  this  is  not  always  the 
case;  e.g.,  if  an  observation  must  be  retained  to  keep  the 
problem  non-singular  then  it  will  have  an  BID  value  of  one 
regardless  of  the  magnitude.  For  a  detailed  discussion  of 
this  point  and  some  examples  see  Kammer  [6]  and  Poston, 
Priebe  and  Holland  [8].  For  this  reason  and  because  the 
desired  output  is  a  robust  estimation  of  location,  the 
centering  of  the  data  at  each  iteration  is  needed,  which  is 
the  reason  for  the  first  step. 

4.  Applications  and  Results 

To  test  the  usefulness  of  this  method,  it  is  applied  to 
several  data  sets  where  the  true  MVB  is  known.  The  paper 
by  Hawkins  [2]  gives  the  correct  subset  and  the  resulting 
volume  of  the  true  MVB  for  these  data  sets.  The  relative 
error  in  the  volume  of  the  ellipse  based  on  the  subset 
obtained  using  the  BID  method  can  then  be  determined  for 
comparison  purposes.  There  are  6  data  sets  which  are 
taken  from  Rousseeuw  and  Leroy  [3].  These  data  are  used 
for  regression  purposes,  and  only  the  predictors  are  used 
here  to  detomine  the  MVB.  The  parameters  of  interest  are 
shown  in  Table  I.  From  this  one  can  see  that  the  data  sizes 
are  relatively  small  ranging  in  size  from 


AircraR 
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■  "0 . 
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2.  Use  the  matrix  X'^-'^  in  Bq.  6  to  calculate  the  Figure  1.  Percent  relative  error  in  the  volume  of 

BID  value  for  each  point  in  the  current  data  set.  the  MVE  as  determined  by  the  EiD  approach. 
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Table  H.  Timing  Results  for  Methods  (sec) 


DataSet 

EID 

Splus,  Genetic 
Algorithm 

Aircraft 

0.22 

68.0 

Coleman 

0.17 

67.0 

Delivery 

0.11 

28.0 

Education 

0.77 

74.0 

Gravity 

0.11 

62.0 

Salinity 

0.22 

50.0 

n=20  to  n=50.  The  dimensionality  of  the  data  is  also  low, 
from  2  to  5  dimensions. 

For  this  study,  the  EID  algorithm  is  implemoited  in 
MATLAB  on  a  486,  33  MHz  computer.  The  relative  error 
in  the  volumes  of  the  minimum  covering  ellipsoid  using  the 
EID  approach  is  shown  in  Rgure  1.  It  is  evident  from  the 
small  error  that  ours  is  a  feasible  approach  to  finding  the 
MVE. 

The  time  needed  to  determine  the  subset  of  points  is 
given  in  Table  II.  Also  in  this  table  are  some  timings 
obtained  using  Splus  3.1  to  determine  the  MVE  estimate  of 
a  covariance  matrix.  This  software  uses  a  genetic 
algorithm  to  find  the  subset  of  points.  The  purpose  here  is 
to  provide  a  very  rough  comparison  of  the  two  methods  in 
terms  of  the  computational  effort  involved.  From  these 
results,  one  can  see  that  using  the  EID  provides  a  savings 
in  time  when  calculating  the  MVE.  This  would  become 
more  important  as  the  dimensionality  and  size  of  the  data 
set  increases. 

The  2-dimensional  ‘delivery’  data  set  is  shown  in 
Figure  2  to  provide  a  qualitative  assessment  of  the  method. 
From  this,  one  can  see  that  the  bulk  of  the  data  is  clustered 
toward  the  origin.  We  would  suspect  then  that  the  MVE 
would  be  in  this  area  also.  Figures  shows  the  data  set  that 
corresponds  to  the  true  MVE  as  given  in  Hawkins  [2].  As 
expected,  the  MVE  is  near  the  origin.  When  the  EID 
method  is  applied  to  this  data  set,  the  first  observations  that 
are  deleted  are  the  outlying  ones  in  the  upper  right-hand 
comer  of  the  plot.  It  is  not  until  the  last  points  are  deleted 
that  the  EID  algorithm  makes  an  incorrect  choice.  The  set 
chosen  by  the  EID  approach  is  shown  in  Figure  4.  Note  the 
point  that  is  incorrectly  retained  in  the  set.  One  reason  for 
this  error  is  that  the  point  the  EID  deletes  has  a  larger 
magnitude  than  the  one  that  should  be  kept  in  the  set.  As 
stated  before,  these  will  be  the  points  that  tend  to  have  a 
larger  EID  value  in  some  cases. 

Finally,  one  last  comparison  is  in  order  regarding  the 
‘salinity’  data  set.  It  is  stated  in  Hawkins  [2]  that  this  set 
would  require  approximately  5,000  random  starts  with  the 
FSA  to  reliably  determine  the  MVE  which  is  a 


computationally  intensive  task.  Note  that  for  this  data  set 
the  EDO  method  of  subset  selection  finds  a  set  of  points  in 
0.22  sec  with  only  3%  error  in  the  volume  of  the  ellipse. 

5.  Summary 

In  this  paper,  the  EID  method  of  determining  the  subset 
of  points  used  in  the  MVE  has  been  described.  Subset 
selection  is  what  makes  the  MVE  a  computationally 
expensive  algorithm  to  implement  in  daily  practice. 
Preliminary  results  indicate  that  the  EID  method  for 
selecting  the  set  of  points  to  be  included  in  the  MVE 
estimator  is  a  useful  one.  The  time  required  for  subset 
selection  is  less  than  a  second  for  the  data  sets  considered 
here,  and  it  is  expected  that  for  large  n  similar  savings  in 
time  can  be  achieved. 

The  2-dimensional  scatt»plots  of  the  ‘delivery’  data 
indicate  qualitatively  that  the  EID  tends  to  pick  a  tighter 
cluster  of  points.  Whereas  the  set  of  points  making  up  the 
true  MVE  is  somewhat  narrower.  This  example  helps 
illustrate  an  important  point  about  the  MVE.  Since  it  is  an 
ellipsoid  of  minimum  volume  it  does  not  necessarily  pick 
the  tightest  cluster  of  data.  It  is  suspected  that  the  EID 
approach  might  yield  better  results  based  on  some  other 
criterion;  e.g.,  better  covariance  structure  or  clustering. 
These  ideas  will  be  examined  in  more  detail  as  part  of  the 
future  work  in  this  area. 

Although  the  EID  method  is  not  guaranteed  to  find  the 
true  MVE,  it  has  certain  advantages  that  make  it  more 
attractive  than  the  algorithms  currently  in  use.  As 
discussed  previously,  it  involves  little  computational  effort, 
and  thus  it  is  suitable  fra*  sets  with  large  n  and  p.  Also,  due 
to  the  iterative  nature  of  the  method,  it  would  ^  easy  to  get 
a  family  of  estimators  fw  different  values  of  h  which  is  a 
useful  feature  [2].  This  makes  the  use  of  the  EID  method 
feasible,  thus  allowing  the  statistician  to  easily  employ  this 
robust  method  of  estimating  multivariate  location  and 
scatter. 
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Abstract:  We  discuss  computational  approaches  to 
multivariate  function  estimation  using  strengths  from  the 
spline  and  local  fitting  worlds.  In  particular,  we  con¬ 
sider  piecewise  polynomial  representations  of  the  surface 
that  can  be  easily  evaluated  at  any  point  in  the  domain, 
and  estimating  parameters  using  local  approaches.  Ideas 
along  these  lines  using  rectangular  partitions  of  the  do¬ 
main  have  been  used  previously  by  Silverman  (1981)  and 
Cleveland  and  Grosse  (1991).  In  this  paper  we  propose 
triangular  partitions  which  may  offer  greater  efficiency 
and  adaptability  in  modeling  surfaces. 

1  Introduction 

Figure  1  is  a  contour  plot  of  a  density  estimate.  The 
sample  here  consists  of  225  observations  from  an  equal 
mixture  of  three  standard  bivariate  normal  distributions. 

This  figure  was  constructed  by  direct  application  of 
the  local  likelihood  density  estimate,  discussed  in  Loader 
(1993).  In  particular,  for  each  point  a;  on  a  100  x  100  grid, 
the  following  system  of  equations  was  solved  for  o: 

1=1  '  / 


Figure  1:  Density  Estimate  from  225  points.  Direct  appli¬ 
cation  of  the  local  likelihood  method.  Local  cubic  fitting 
was  used  with  a  variable  bandwidth  covering  113  data 
points. 


=  J  A{u  -  x)K  (1) 

Here,  A{v)  is  a  vector  of  the  ten  cubic  basis  polynomi¬ 
als  and  a  is  a  vector  of  unknown  local  coefficients.  The 
density  estimate  is  f{x)  =  exp((a,  A(0))).  As  an  estimate, 
the  local  likelihood  method  performs  quite  well;  the  three 
peaks  are  clearly  separated  and  are  reproduced  almost  to 
full  height;  the  true  peak  height  is  about  0.053. 

However,  as  a  computational  method  direct  applica¬ 
tion  of  the  local  likelihood  method  is  wasteful  and  ineffi¬ 
cient.  The  local  likelihood  method  assumes  the  underly¬ 
ing  density  is  smooth,  and  therefore  it  seems  unnecessary 
to  work  independently  at  very  close  points. 

Moreover,  we  only  have  the  estimate  evaluated  on  a 
discrete  grid  of  points;  for  many  purposes,  this  may  be 
unsatisfactory.  To  use  the  fitted  surface  for  classification 
or  prediction  requires  us  to  be  able  to  readily  evaluate 
the  surface  at  any  point;  not  just  the  grid  points.  In 
the  regression  setting,  a  residual  plot  requires  evaluation 
of  the  surface  at  the  data  points.  Particularly  in  higher 
dimensions,  one  must  visualize  the  surface  by  looking  at 


lower  dimensional  sections.  Or,  one  may  be  interested  in 
looking  more  closely  at  part  of  the  surface. 

The  subject  of  this  paper  will  be  better  ways  to  com¬ 
pute  local  fits.  In  particular,  we  require  efficient  ways  to 
compute  and  represent  the  fitted  surface.  The  examples 
presented  are  all  density  estimation,  although  very  little 
is  specific  to  this  case. 

2  What  Computation  is,  and  is 
not 

The  computation  of  local  fits  is  often  portrayed  in  the 
statistics  literature  as  a  race  to  evaluate  the  surface  as 
fast  as  possible  at  a  predetermined  set  of  points.  This 
represents  a  very  narrow  focus;  while  computation  speed 
is  certainly  important,  there  are  many  other  important 
considerations. 

Fundamentally,  the  fit  must  summarize  the  data;  not 
vice  versa.  The  surface  should  be  represented  in  a  com¬ 
pact  manner  to  allow  prediction,  classification,  resam- 
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pling,  interactive  visualization  etc.  Just  computing  the 
surface  on  a  grid  of  points  is  in  general  not  satisfactory. 

An  algorithm  for  local  fitting  generally  consists  of  a 
number  building  blocks.  On  top  of  basic  fast  algorithms, 
one  may  want  additional  features,  such  as  choices  in  the 
order  of  fitting;  robustness;  variable  band  widths  and  ap¬ 
plication  to  a  wide  variety  of  settings  such  as  regression, 
density  estimation,  local  likelihood  fitting  and  multivari¬ 
ate  problems.  Ideally,  the  basic  computational  algorithms 
should  be  as  flexible  to  as  many  of  these  settings  as  pos¬ 
sible.  Methods  that  require  a  fixed  bandwidth,  or  only 
work  in  one  dimension,  are  of  very  limited  utility. 

For  the  statistician,  it  is  important  that  procedures 
be  implemented  within  the  environments  with  which  they 
are  familiar,  and  that  provide  facilities  for  data  manage¬ 
ment  and  interactive  analysis  and  graphics  and  the  like.  A 
stand-alone  C  or  Fortran  program  will  certainly  be  much 
faster  to  execute  than  software  incorporated  in  a  statisti¬ 
cal  computing  environment;  however  it  is  much  less  use¬ 
ful  to  the  statistician  because  of  the  overhead  required  to 
manage  data,  interface  with  other  systems  for  display  etc. 

We  can  now  state  why  some  of  the  ‘fast’  methods  dis¬ 
cussed  in  the  statistics  literature  are  essentially  useless  as 
general  purpose  computational  algorithms.  As  particular 
examples,  we  focus  on  binning  and  updating  methods, 
presented  for  example  by  Fan  and  Marron  (1994).  First 
we  note  the  comparisons  in  that  paper  are  essentially 
meaningless  in  light  of  the  preceding  discussion:  Figure 
8  of  Fan  and  Marron  (1994)  compares  times  for  local  lin¬ 
ear  smoothing;  one  curve  for  a  weighted  smoother  with 
robustness  iterations  running  in  the  S  environment  while 
the  methods  promoted  by  Fan  and  Marron  use  a  uniform 
kernel  smoother  without  robustness  iterations  and  com¬ 
piled  as  a  stand-alone  C  program.  This  comparison  says 
nothing  about  the  basic  computational  algorithms. 

The  binning  method  divides  the  predictor  space  into 
intervals  or  a  grid,  and  assigns  observations  to  grid  points. 
For  a  large  samples,  this  effectively  reduces  the  sample 
size,  thereby  speeding  up  the  computations.  But  the  bin¬ 
ning  step  is  essentially  a  preliminary  local  constant  fit;  to 
avoid  bias  problems,  large  numbers  of  bins  are  required  to 
avoid  bias  problems.  This  is  particularly  so  in  the  multi¬ 
variate  case.  Fan  and  Marron  report  the  major  speedups 
result  from  reducing  the  observations  to  an  equally  spaced 
grid,  reducing  the  number  of  kernel  evaluations.  However, 
this  only  works  with  a  constant  bandwidth! 

Updating  methods  are  designed  for  polynomial  weight 
functions,  and  expand  sums  such  as  those  on  the  left  hand 
side  of  (1)  in  powers  of  the  Xi.  The  sums  are  then  up¬ 
dated  by  moving  to  nearby  x  values.  However,  currently 
available  implementations  only  address  one  dimensional 
problems,  and  the  fastest  implementations  have  serious 
stability  problems. 

Most  importantly,  both  of  these  methods  require  so¬ 


lution  of  the  optimization  problem  (1)  for  each  fitting 
point.  Hence,  particularly  for  our  density  estimation  set¬ 
ting,  these  methods  are  unlikely  to  be  particularly  fast. 
Also,  the  methods  are  geared  towards  evaluating  the  esti¬ 
mate  on  a  predetermined  set  of  points  (for  binning,  on  an 
equally  spaced  grid),  which  is  not  satisfactory  for  many 
statistical  purposes  such  as  prediction,  classification,  re¬ 
sampling  and  interactive  graphics.  Neither  method  pro¬ 
vides  a  compact  representation  of  the  fitted  surface. 

3  Piecewise  Polynomials 

Suppose  we  have  a  sequence  of  vertices  uq  <  vi  <  . . .  < 
Ufc,  and  evaluations  of  a  function  f[vi)  and  its  deriva¬ 
tive  f{vi)  at  each  vertex.  There  exists  a  unique 
function  that  matches  the  function  values  and  derivatives 
at  each  vertex,  and  is  piecewise  cubic  on  each  interval 
This  type  of  construction  enables  us  to  ap¬ 
proximate  smooth  functions  quite  cheaply  using  Hermite 
polynomials.  See  De  Boor  (1978,  Chapter  4)  for  further 
discussion  of  this  scheme. 

Another  closely  related  method  is  the  cubic  spline  ap¬ 
proximation.  In  the  most  common  use,  the  cubic  spline 
method  enforces  continuity  of  the  first  and  second  deriva¬ 
tives;  it  is  not  required  to  match  the  true  derivatives. 

Several  methods  along  these  lines  have  been  used  in 
nonparametric  function  estimation.  Penalized  likelihood 
methods  (Wahba,  1990)  give  rise  to  a  cubic  spline  esti¬ 
mate  with  vertices  at  the  data  points.  Regression  splines, 
and  the  logspline  density  methods  of  Kooperberg  and 
Stone  (1991)  estimate  the  parameters  of  a  spline  using 
criteria  such  as  maximum  likelihood;  this  enables  the 
number  of  vertices  to  be  substantially  reduced.  An  alter¬ 
native  method  is  to  estimate  the  parameters  using  local 
regression  or  likelihood  methods.  For  this  approach  to  be 
computationally  efficient,  it  is  important  to  make  effective 
use  of  the  vertices;  for  this  reason  the  Hermite  polynomial 
rather  than  cubic  spline  approach  is  preferable.  This  is 
the  idea  underlying  the  LOESS  method  (Cleveland  and 
Grosse,  1991). 

The  computational  advantage  of  local  regression  and 
likelihood  appears  in  multidimensional  cases.  Global  ap¬ 
proaches  to  fitting  generally  involve  the  solution  of  a  laxge 
scale  optimization  problem,  which  is  generally  very  ex¬ 
pensive.  While  sparse  matrix  techniques  have  been  suc¬ 
cessfully  applied  to  spline  problems  in  one  dimension, 
there  use  in  multiple  dimensions  is  much  more  difficult. 
By  comparison,  local  regression  methods  solve  a  small 
optimization  problem  at  each  vertex. 

Two  fundamental  problems  remain:  The  construction 
of  a  suitable  partition  of  a  multidimensional  domain,  and 
the  construction  of  interpolants  over  this  partition.  The 
most  common  types  of  partition  involve  rectangular  cells 
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or  triangular  cells. 

A  rectangular  partition  is  often  preferred  in  theoreti- 
cal  work,  since  it  is  often  easier  (or  less  difficult)  to  ana¬ 
lyze,  and  define  suitable  polynomial  approximations.  The 
earliest  work  using  local  fits  in  this  direction  appears  to 
be  Silverman  (1981)  who  constructed  an  estimate  based 
on  evaluation  of  a  fixed  bandwidth  kernel  estimate  and 
its  derivatives  over  a  coarse  grid,  and  used  a  piecewise 
quadratic  scheme  to  interpolate  over  the  cells  of  the  grid. 

A  further  development  in  this  direction  is  the  k-d  tree 
structure  introduced  by  Friedman,  Bentley  and  Finkel 
(1977)  and  applied  to  local  regression  by  Cleveland  and 
Grosse  (1991).  In  this  algorithm,  the  data  is  initially 
bounded  by  a  box,  and  the  cells  are  recursively  split.  The 
splits  in  the  k-d  tree  algorithm  always  divide  observations 
in  the  parent  cell  into  two  subsets  of  approximately  equal 
size,  and  hence  the  resultant  structure  has  most  vertices 
in  regions  of  high  point  density.  This  is  commensurate 
with  the  nearest  neighbor  band  widths  used  in  LOESS; 
however,  other  split  rules  could  be  used  to  adapt  to  other 
situations. 


This  problem  becomes  particularly  complex  in  three  or 
more  dimensions. 

An  alternative  to  rectangular  cells  is  partitions  based 
on  triangular  cells.  A  triangle  in  d  dimensions  has  d  +  1 
vertices,  against  2^  for  cubes.  This  suggests  there  are 
potential  savings  in  multiple  dimensions.  The  word  po¬ 
tential  is  stressed  here,  since  there  are  both  ‘good’  and 
‘bad’  triangles. 

Some  examples  of  good  and  bad  triangles  are  shown  in 
Figure  3.  The  ideal  triangulation  would  consist  of  equilat¬ 
eral  triangles.  In  two  dimensions,  these  can  be  tessalated; 
unfortunately  this  is  non-adaptive  and  therefore  may  be 
wasteful  in  regions  of  low  density  where  a  large  bandwidth 
must  be  used.  The  right  angled  triangle  scores  ‘ok’;  this 
triangle  can  be  interpolated  over  with  reasonable  success, 
but  we  are  unlikely  to  gain  much  over  the  use  of  rect¬ 
angular  grids.  The  tall  isosceles  triangle  is  poor,  since 
some  points  may  be  far  from  the  nearest  vertex  of  the  tri¬ 
angulation.  Finally,  the  flat  isosceles  triangle  scores  bad; 
interpolating  in  the  middle  of  the  vertical  edge  will  ignore 
the  two  side  vertices. 


Figure  2:  Division  of  225  data  points  by  a  k-d  tree  with 
34  vertices  and  16  cells.  Vertices  are  numbered  in  the 
order  they  are  entered  into  the  tree. 

An  example  of  a  k-d  tree  is  shown  in  Figure  2.  As 
required,  this  partition  has  most  vertices  in  regions  of 
high  density.  One  can  also  identify  some  weaknesses; 
in  particular,  occasionally  vertices  occur  at  very  close 
points  suggesting  inefficiency.  Also,  construction  of  in- 
terpolants  over  a  rectangle  depends  on  more  than  just 
the  corners;  for  example,  interpolating  over  the  rectangle 
(10, 11, 18, 19)  cannot  ignore  the  evaluation  at  vertex  20. 


Figure  3:  Examples  of  good  and  bad  triangles  in  a  trian¬ 
gulation. 

In  addition  to  requiring  good  triangles,  there  are  sev¬ 
eral  other  competing  criteria  that  must  be  considered 
when  growing  a  triangulation.  We  wish  to  maintain  adap¬ 
tiveness;  in  particular,  there  should  be  more  vertices  in 
regions  where  small  band  widths  are  used.  The  triangles 
must  be  small  enough  for  interpolants  to  work  reason¬ 
ably;  however,  the  vertices  should  be  sparse  enough  for 
efficiency.  Finally,  the  triangulation  must  be  fast  to  grow 
and  search. 
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4  Recursive  Partitioning 

In  this  section,  one  possible  triangulation  scheme  based 
on  recursive  partitioning  is  presented.  First,  draw  a  box 
around  the  data,  and  divide  the  box  into  two  triangles 
by  the  addition  of  a  diagonal.  The  triangulation  is  then 
grown  recursively  by  adding  vertices  to  existing  edges. 
Suppose  existing  vertices  are  Ui , . . . , u„i,  and 
are  the  bandwidths  used  at  these  vertices.  For  each  edge 
{vi,Vj)  in  the  existing  triangulation,  assign  a  score: 

If  pij  >  c  split  the  edge.  Let  A  =  and  create 

a  new  vertex  at  'Wm+i  =  +  (1  A)t;i.  We  then  create 

the  new  edges  joining  points  to  Vm+i-  This  process  is 
repeated  until  no  more  edges  require  splitting. 

The  use  of  bandwidths  in  determining  the  splits  allows 
the  algorithm  to  preserve  adaptivity.  If  hi  is  smaller  than 
hj,,  then  A  will  be  less  than  0.5  and  Vm-\-i  is  closer  to  Vi. 
In  extreme  cases  this  can  over  adapt,  and  so  we  restrict 
A  to  the  interval  (0.2, 0.8). 


Figure  4:  Recursive  triangulation.  Vertices  are  numbered 
in  the  order  they  were  entered  into  the  triangulation. 

Figure  4  shows  a  recursive  triangulation  grown  on  the 
trimodal  data  in  Figure  1.  The  vertices  are  numbered  in 
the  order  they  were  entered  into  the  triangulation.  The 
bandwidth  used  is  variable,  covering  113  nearest  neigh¬ 
bors  for  each  fitting  point.  Some  adaptivity  can  be  seen  in 
this  picture:  There  is  a  greater  density  of  vertices  around 
the  three  peaks,  where  a  smaller  bandwidth  is  used  for 
the  fit. 


The  recursive  partitioning  presented  here  has  both 
good  and  bad  features.  A  big  advantage  is  the  triangula¬ 
tion  can  be  stored  in  a  tree-like  structure;  for  any  point  x, 
it  can  then  be  rapidly  determined  which  triangle  contains 
X,  The  disadvantage  is  a  recursive  scheme  scores  only  an 
‘OK’  in  terms  of  goodness  of  triangles,  and  so  one  can’t 
expect  substantial  gains  in  efldciency.  There  are  also  some 
bad  splits  in  the  algorithm;  for  example,  point  7  was  used 
to  split  the  triangle  (0, 1,4);  it  would  have  been  better  for 
the  algorithm  to  split  the  (1,4)  edge  first. 

5  Finite  Element  Interpolants 

The  finite  element  method  constructs  interpolants  over 
the  cells  of  a  partition.  Only  vertices  on  the  boundary  of 
a  cell  are  used.  This  has  computational  advantages;  the 
interpolants  depend  on  only  a  small  number  of  param¬ 
eters  and  hence  do  not  require  solving  large  systems  of 
equations. 

The  cell-based  construction  of  finite  elements  leads  to 
substantial  difficulty  in  enforcing  global  smoothness  con¬ 
ditions.  For  visualization  purposes,  we  would  like  our 
surface  to  be  continuous  and  differentiable.  Also,  the  in- 
terpolant  should  be  commensurate  with  our  fitting  pro¬ 
cedure;  for  example,  an  interpolant  that  only  reproduces 
linear  polynomials  is  not  adequate  for  use  with  a  local 
quadratic  or  cubic  fitting  procedure. 

A  two  dimensional  element  suitable  for  our  purposes 
is  the  Clough-Tocher  finite  element  (Clough  and  Tocher, 
1965;  Lancaster  and  Salkauskas,  1986).  This  method  uses 
twelve  pieces  of  information:  The  function  values  and 
derivatives  at  each  vertex  of  the  triangle,  and  the  normal 
derivatives  at  the  midpoint  of  each  side.  The  Clough- 
Tocher  finite  element  is  then  piecewise  cubic  over  each  of 
three  sub- triangles,  with  continuity  and  differentiability 
conditions  enforced  at  the  interior  seams.  See  Figure  5. 
A  remarkable  feature  of  the  Clough-Tocher  method  is  it 
produces  a  globally  surface;  when  the  method  is  ap¬ 
plied  independently  on  adjacent  triangles,  the  resulting 
surface  is  differentiable  at  the  common  boundary! 

The  full  twelve  parameter  Clough-Tocher  method  will 
reproduce  a  cubic  polynomial.  Unfortunately  our  local 
fitting  procedure  will  not  produce  the  normal  derivatives 
at  the  midpoints  of  the  sides;  in  practice,  these  are  es¬ 
timated  by  linear  interpolation.  This  reduced  nine  pa¬ 
rameter  Clough-Tocher  method  reproduces  all  quadratic 
terms.  A  cubic  reproducing  scheme  could  be  constructed 
by  using  second  derivatives  at  the  vertices  and  estimating 
normal  derivatives  using  quadratic  interpolation. 

Figure  6  shows  the  Clough-Tocher  method  applied  to 
the  data  from  Figure  1  using  the  triangulation  in  Figure 
4.  Qualitatively  the  picture  looks  very  similar  to  the  di¬ 
rect  fit;  the  three  peaks  are  kept  separate  and  reproduced 
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Figure  5:  Clough-Tocher  method.  The  interpolant  uses 
function  values  at  the  three  vertices  of  the  triangle,  and 
the  nine  directional  derivatives  indicated.  The  inter¬ 
polant  is  made  up  of  three  piecewise  cubics  over  the  inte¬ 
rior  triangles. 

to  a  similar  height  in  both  figures.  Some  differences  are 
visible,  particularly  in  the  0.01  contours. 


Figure  6:  Density  estimate  using  Clough-Tocher  interpo¬ 
lation  over  a  triangulation.  Local  cubic  fitting. 

Alternative  constructions  of  interpolants  can  be  based 
on  piecewise  quadratic,  rather  than  cubic,  polynomials. 
This  has  advantages  noted  by  Silverman  (1981)  for  pro¬ 
ducing  contour  plots  since  level  sets  can  be  readily  found. 
Another  advantage  is  in  locating  local  maxima  of  the  esti¬ 
mate.  The  difficulty  is  that  a  quadratic  has  fewer  degrees 


No.  Vertices 

d{fj) 

d{fj) 

Direct 

225 

- 

0.261 

Triangulation 

55 

0.0296 

0.269 

k-d  tree 

34 

0.0638 

0.282 

k-d  tree 

66 

0.0345 

0.264 

Table  1:  Comparison  of  direct  fitting  with  triangulation 
and  k-d  tree  based  interpolation  schemes. 

of  freedom,  and  additional  internal  seams  must  be  in¬ 
troduced  to  enforce  boundary  constraints.  For  example, 
Silverman  divides  each  cell  into  sixteen  triangles.  For  a 
construction  of  a  quadratic  surface  on  a  triangular 
partitions,  see  section  6.2  of  Chui  (1988). 

6  Concluding  Remarks 

The  triangulation  approach  is  competing  with  the  k-d  tree 
as  an  approximation  method.  A  natural  question  is  to  try 
to  make  an  objective  comparison  of  their  performance. 
We  consider  ‘comparable’  estimates  to  consist  of  the  same 
or  similar  numbers  of  vertices.  The  real  question  of  course 
is  how  well  do  the  interpolated  estimates  approximate 
the  true  estimate.  In  practice  we  cannot  measure  this, 
so  we  also  consider  how  well  the  interpolated  estimate 
approximates  the  direct.  We  consider  the  criterion 

^  E I  hXi)  -  log  f{Xi)\ 

X=1 

where  f{x)  is  the  direct  density  estimate  and  f{x)  is  the 
interpolated  density  estimate,  where  /  is  either  density  or 
log-density;  /o  is  direct  and  /i  is  triangulation  or  kdtree 
estimate.  This  is  also  an  approximation  to  Li  distance 
/ 1/(^)  “*  f{x)\dx  between  the  two  estimates. 

Table  1  shows  the  results  for  our  triangulation  and 
two  different  sizes  of  k-d  tree.  As  we  would  hope,  the  34 
vertex  k-d  tree  is  substantially  beaten.  The  triangulation 
also  slightly  beats  the  66  vertex  k-d  tree  compared  to  the 
direct  estimate,  but  loses  slightly  compared  to  the  true 
density.  Of  course,  not  too  much  should  be  concluded 
from  one  example  selected  by  the  author;  however  we 
believe  this  certainly  gives  grounds  for  optimism. 

The  triangulation  used  in  section  4  is  based  on  recur¬ 
sive  partitioning.  This  enables  the  triangulation  to  be 
stored  as  a  tree  type  structure,  which  facilitates  rapid 
searching;  in  particular,  one  can  rapidly  determine  which 
triangle  contains  an  arbitrary  point  in  the  domain.  There 
are  however  some  disadvantages;  in  particular,  we  have 
noted  the  resulting  triangulation  will  generally  only  score 
‘OK’  on  the  scale  of  Figure  3.  We  have  also  given  up  an- 
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other  major  advantage  of  triangulations;  namely  the  abil¬ 
ity  to  well  approximate  fairly  arbitrary  nonlinear  bound¬ 
aries  and  domains. 

There  are  alternative  methods  of  growing  triangula¬ 
tion,  any  of  which  may  have  advantages  and  disadvan¬ 
tages  in  our  application.  An  obvious  alternative  is  a  se¬ 
quential  scheme,  beginning  with  a  seed  triangle  in  the 
center  of  the  data,  and  adding  neighboring  triangles  until 
the  domain  is  filled.  This  loses  the  tree  structure  of  our 
recursive  scheme.  Some  advantages  are  that  ‘good’  trian¬ 
gulations  may  result  more  readily,  and  easier  adaptability 
to  non-rectangular  boundaries. 

Another  step  that  might  be  considered  is  to  attempt 
to  optimize  the  triangulation  in  some  sense.  While  a  full 
blown  optimization  is  the  type  of  expensive  problem  we 
seek  to  avoid,  some  improvement  may  be  obtained  by 
moving  vertices  around  or  constructing  triangulations  on 
a  given  set  of  vertices.  Some  discussion  of  such  schemes 
can  be  found  in  Barnhill  (1977). 

There  are  a  number  of  directions  in  which  this  work 
can  be  extended.  Perhaps  the  most  obvious  is  beyond  two 
dimensions,  where  as  already  noted  there  is  potential  for 
triangulations  to  be  much  more  efllcient  than  rectangular 
grids. 

An  obvious  extension  of  this  work  (indeed  the  original 
motivation)  is  to  higher  dimensions,  where  the  potential 
saving  of  triangular  cells  over  rectangular  by  reducing  the 
number  of  vertices  seems  much  greater.  Of  course,  any 
nonparametric  fitting  becomes  much  more  difficult  be¬ 
yond  two  dimensions  due  to  data  sparseness.  However, 
the  construction  of  good  triangulations  is  more  difficult; 
also  difficult  is  the  problem  of  enforcing  global  differen¬ 
tiability  in  finite  element  interpolants. 

There  are  also  some  theoretical  questions  that  could 
be  pursued.  One  obvious  question  is  to  analyze  how  well 
interpolated  methods  perform  as  estimates;  in  particu¬ 
lar,  we  hope  to  preserve  good  properties  of  local  poly¬ 
nomial  fitting.  Such  an  analysis  may  also  help  suggest 
refinements  to  the  triangulation;  in  particular,  how  deep 
should  the  triangulation  be  taken?  Another  question  is 
how  to  estimate  derivatives.  The  present  implementation 
uses  the  local  slopes  from  the  local  polynomial  fits;  this 
is  convenient  since  the  slopes  are  available  at  no  extra 
computation  costs.  Theoretical  analysis  suggests  these 
slopes  have  good  properties  as  derivative  estimates;  how¬ 
ever,  not  always  at  the  same  bandwidth  as  for  estimating 
the  function  itself.  The  author  has  found  this  problem 
severe  in  some  cases  for  local  quadratic  fitting;  hence  the 
preference  for  local  cubic  fitting  in  the  examples  in  this 
paper. 

Finally,  we  mention  some  ongoing  work  in  nonpara¬ 
metric  function  estimation  related  to  the  topics  discussed 
in  this  paper.  Eric  Grosse  is  considering  alternative  con¬ 
structions  of  Loess  estimates  based  on  the  k-d  tree  parti¬ 


tion;  in  particular,  placing  vertices  at  the  centers  of  the 
cells,  rather  than  the  corners.  This  solves  the  problem 
of  nearby  vertices  mentioned  in  relation  to  Figure  2,  and 
should  improve  efficiency  of  the  scheme.  Mark  Hansen 
and  Charles  Kooperberg  are  using  triangulation  methods 
in  a  global  likelihood  estimation  scheme,  considering  gen¬ 
eralized  vertex  splines  (Chui,  1988)  and  other  methods  to 
construct  estimates. 
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Abstract 

Electronic  bulletin  boards,  client/server  computing, 
and  inter-network  connectivity  have  promoted  the  trans¬ 
fer  and  exchange  of  data.  Differences  in  computer  hard¬ 
ware  prevent  seamless  transfer  of  data  from  one  com¬ 
puter  system  to  another.  Data  transfer  implies  data  con¬ 
version  except  when  compatible  hardware  is  employed. 
Even  when  compatible  hardware  is  used,  careless  spec¬ 
ification  of  data  formats  may  lead  to  serious  problems. 
Situations  arise  where  improper  conversion  may  go  unde¬ 
tected,  leading  to  the  statistical  analysis  of  invalid  data. 
This  article  addresses  data  conversion  issues  and  high¬ 
lights  pitfalls  that  impact  the  transfer  of  numerical  data 
between  different  computer  systems. 

1.  INTRODUCTION 

An  insurance  company  must  deliver  an  extract  of 
claims  data  stored  on  an  IBM  mainframe  computer  to 
a  small  information  services  company  which  operates  a 
network  of  UNIX  workstations  and  personal  computers. 
The  insurance  company  provides  the  data  on  3480  tapes 
using  a  variable  length  record  layout.  A  block  size  B 
is  specified,  and  records  have  no  record  delimiters.  In¬ 
stead,  individual  records  are  determined  by  processing 
the  first  four  bytes  of  a  record  as  the  record  length  in 
bytes.  Blocking  is  determined  by  a  four  byte  field  that 
begins  each  block  and  gives  the  block  size  in  bytes.  A 
block  may  contain  more  than  one  record,  but  no  partial 
records.  Both  block  size  and  record  length  information 
is  stored  in  binary. 

Fields  in  the  data  extract  are  copied  directly  from  the 
claims  system,  so  the  extracted  data  contains  a  variety 
of  data  types,  including  a  numeric  format  called  “packed 
decimal”  which  will  be  described  later.  Furthermore,  no 
conversion  is  applied  to  the  extracted  data,  so  character 
data  is  stored  using  EBCDIC  encoding. 

On  the  information  services  company  side,  the  UNIX 
and  PC  systems  use  ASCII  encoding.  The  systems  ad¬ 
ministrator  reads  the  tape  dataset  directly  onto  the  hard 
disk  of  a  UNIX  workstation.  The  blocks  are  converted 


from  EBCDIC  to  ASCII  before  writing  to  disk.  The  sys¬ 
tems  administrator  reasons  that  data  from  an  EBCDIC 
system  must  be  converted  to  ASCII  for  the  ASCII-based 
UNIX  system. 

The  statistician  obtains  a  66,077,850  byte  file  (approx¬ 
imately  63  Mb)  with  the  above  information,  a  detailed 
record  layout,  and  block  size  B  =  32000.  She  reads  the 
first  four  bytes  using  a  positive  integer  binary  format 
and  obtains  the  value  16,459,  which  she  interprets  to  be 
the  size  of  the  first  block.  The  documentation  implies 
that  record  sizes  are  one  of  260,  270,  280,  or  290  bytes. 
An  odd  block  size  is  not  possible  even  with  the  four  byte 
block  header. 

The  first  20  fields  on  all  records  are  the  same,  and 
the  fifth  field  is  a  four  byte  packed  decimal  field.  Her 
attempts  to  read  the  fifth  field  of  the  first  record  as 
a  packed  decimal  field  fail.  She  gets  the  hex  value 
0003A61Ch.  The  statistician  is  now  convinced  that  the 
data  is  corrupt. 

In  this  example,  the  statistician  received  the  data  in¬ 
directly  from  a  DP  staff  member.  She  was  informed  by 
the  person  who  transferred  the  data  from  tape  to  disk, 
“I  read  the  data  and  converted  it  using  the  system  dd 
command  with  conversion  equal  ASCII  turned  on.”  At 
the  time,  it  made  sense  to  convert  IBM  370  data  created 
on  an  EBCDIC  system  to  the  standard  ASCII  encoding 
used  on  UNIX  workstations.  This  seemingly  insignificant 
conversion  exercise  effectively  encoded  the  data  into  an 
unusable  but  salvageable  form.  Had  the  data  remained 
in  “EBCDIC”  format,  the  block  header  would  have  been 
read  as  a  realistic  value  of  31954,  and  the  packed  decimal 
field  would  have  produced  0003471Ch,  or  3,471. 

With  the  rise  of  client/server  computing  and  networks 
of  computers  exchanging  and  sharing  information,  statis¬ 
ticians  must  be  aware  of  the  potential  pitfalls  inherent 
in  the  transfer  and  conversion  of  data.  This  paper  ad¬ 
dresses  some  of  the  pitfalls  and  provides  guidance  for 
transferring  data  and  diagnosing  problems. 

2.  EBCDIC  AND  ASCII  ENCODING 

The  ASCII  (American  National  Standard  Code  for  In¬ 
formation  Interchange)  conversion  standard  associates  a 
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one  byte  binary  integer  value  with  an  interpreted  ‘char- 
acter\  which  can  be:  (1)  a  printable  character,  referred 
to  as  an  ASCII  character,  (2)  a  communication  or  ‘hand¬ 
shaking^  code,  or  (3)  an  ‘escape’  or  ‘control’  character. 
ASCII  decimal  values  32  through  126  (hex  20h  through 
7Eh)  are  represented  by  standard  keyboard  symbols  and 
are  often  referred  to  as  ‘printable  characters’.  These  in¬ 
clude  symbols  like  !,  <,  9,  A,  a,  and  ].  If  Shift-A,  is 
typed,  ‘A’  is  displayed,  and  the  decimal  value  65  (hex 
41h)  is  placed  in  the  keyboard  input  queue.  ASCII  dec¬ 
imal  values  0  through  31  (hex  Oh  through  IFh)  have 
meanings  that  may  vary  depending  on  the  software  used 
or  the  device  that  is  sending  or  receiving  the  values.  For 
example,  the  linefeed  character  is  ASCII  decimal  10  (hex 
Ah),  form  feed  is  ASCII  decimal  12  (hex  Ch),  and  car¬ 
riage  return  is  ASCII  decimal  13  (hex  Dh).  ASCII  dec¬ 
imal  3  is  the  familiar  Control-C.  ASCII  decimal  values 
128  through  255  may  have  standard  meanings,  but  many 
software  products  do  not  recognize  the  standard  values 
for  ‘extended  ASCII’. 

The  EBCDIC  (Extended  Binary-Coded  Decimal  In¬ 
terchange  Code)  conversion  standard  performs  the  same 
function  as  ASCII,  except  EBCDIC  has  a  richer  set  of 
printable  characters,  including  characters  /!  and 

ASCII  and  EBCDIC  imply  a  sort  order  for  symbols, 
and  the  order  is  different.  For  example,  EBCDIC  sort 
order  places  b  (decimal  130)  before  B  (decimal  194), 
whereas  ASCII  sort  order  places  B  (decimal  66)  before 
b  (decimal  98). 

If  data  is  stored  as  EBCDIC,  it  may  be  converted  to 
ASCII,  although  conversions  for  extended  ASCII  or  for 
non-printable  characters  may  be  dependent  on  the  ap¬ 
plication.  The  UNIX  dd  command  employs  a  one-to- 
one  mapping  between  the  two  encodings.  Some  software 
products  do  not  provide  a  one-to-one  mapping  for  ASCII 
codes  above  127.  For  non-standard  encodings,  such  as 
ASCII  128  or  EBCDIC  112,  conversion  from  one  encod¬ 
ing  to  another  is  risky,  and  data  with  non-standard  en¬ 
codings  should  be  investigated.  Most  data  represents 
printable  characters  or  numeric  values.  Character  data 
with  non-standard  encoding  should  be  suspect. 

When  data  is  not  encoded  as  pure  ASCII  or  pure 
EBCDIC,  then  ASCII  or  EBCDIC  conversions  must  be 
avoided.  Packed  decimal  is  neither  ASCII  nor  EBCDIC. 
Numeric  formats  are  more  hardware  dependent  than  are 
character  formats  like  ASCII  and  EBCDIC.  Two  UNIX 
workstations  may  use  ASCII  encoding,  but  may  use  to¬ 
tally  different  binary  or  floating  point  storage.  The  mis¬ 
take  that  computer  novices  make  is  to  assume  an  entire 
dataset  is  ASCII  or  EBCDIC,  when  in  fact  only  certain 
data  fields  within  records  are  ASCII  or  EBCDIC. 


3.  NUMERIC  FORMATS 

The  two  most  common  numeric  formats  are  integer  bi¬ 
nary  and  floating  point  binary.  While  some  systems  may 
store  integer  values  in  reverse  bit  order  or  vary  where 
the  sign  bit  is  located,  many  systems  use  a  direct  binary 
translation  of  integer  values.  Thus,  the  rightmost  bit 
represents  units,  the  next  bit  two’s,  the  next  bit  four’s, 
etc. 

Floating  point  storage  is  another  matter,  with  system 
370  format  and  IEEE  format  dominating.  Single  preci¬ 
sion  and  double  precision  are  the  two  standard  floating 
point  storage  sizes.  All  floating  point  storage  modes  in¬ 
volve  a  mantissa  that  records  all  significant  digits,  and 
an  exponent  which  defines  the  location  of  the  decimal 
point.  Floating  point  formats  differ  in:  (1)  the  size  of 
the  mantissa  and  exponent,  and  (2)  the  method  of  stor¬ 
ing  the  numeric  information. 

There  are  4,278,190,592  valid  floating  point  values  us¬ 
ing  the  IEEE  four  byte  (single  precision)  floating  point 
storage  format.  There  are  16,776,704  ‘invalid’  floating 
point  values  using  the  same  IEEE  storage  format.  These 
invalid  values  may  actually  be  interpreted  as  ‘infinity’  or 
‘not-a-number’  (NAN)  values.  If  you  randomly  gener¬ 
ate  four  byte  values  using  a  uniform  generator,  approx¬ 
imately  99.61%  of  the  values  should  be  valid  floating 
point  representations. 

Since  floating  point  is  so  dependent  on  hardware  plat¬ 
form,  it  is  rarely  a  good  storage  format  for  transferring 
data  across  systems.  For  this  reason,  numeric  storage 
modes  like  packed  decimal  and  zoned  decimal  are  popu¬ 
lar. 

Packed  decimal  interprets  the  hex  representation  of  a 
number  as  a  decimal  number.  The  last  half-byte  is  hex  C 
for  positive  values  and  hex  D  for  negative  values.  Thus, 
973Ch  is  positive  973,  and  450Dh  is  negative  450. 

Zoned  decimal  uses  a  single  byte  for  each  numeric 
digit.  Fixed  point  storage  is  assumed  with  an  implied 
decimal.  The  sign,  plus  or  minus,  is  stored  in  the  last 
byte,  and  the  nature  of  the  sign  byte  depends  on  whether 
the  system  is  ASCII  or  EBCDIC.  If  the  rightmost  digit 
is  0,  and  the  number  is  positive,  the  rightmost  byte  will 
have  value  ‘{’  (EBCDIC  hex  CO)  or  ‘0’  (ASCII  hex  30).  If 
the  rightmost  digit  is  0,  and  the  number  is  negative,  the 
rightmost  byte  will  have  value  (EBCDIC  hex  DO)  or 
‘p’  (ASCII  hex  70).  Positive  sign  bytes  progress  from  hex 
CO  to  hex  C9  on  EBCDIC  systems,  and  from  hex  30  to 
hex  39  on  ASCII  systems.  Negative  sign  bytes  progress 
from  hex  DO  to  hex  D9  on  EBCDIC  systems,  and  from 
hex  70  to  hex  79  on  ASCII  systems.  The  rightmost  hex 
digit  is  the  value  of  the  last  digit  in  the  number. 

The  following  table  presents  statistics  on  an  arbitrary 
set  of  text  files  read  in  as  4  byte  floating  point  words 
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using  IEEE  single  precision  floating  point  storage. 


Text  Files  Read  as  Floats,  4  Byte  Words 

Source 

Mean 

Minimum 

Maximum 

IEEE  value 
IEEE  exponent 
S370  value 

S370  exponent 

3.047e296 

161.85 

7.475e70 

39.17 

2.646e-26'0 

-260 

9.658e-67 

-67 

1.367e301 

301 

2.057e74 

74 

The  unusual  magnitude  of  the  values  should  be  suffi¬ 
cient  to  indicate  that  the  wrong  conversion  format  was 
employed.  Large  ranges  imply  that  the  floating  point 
format  is  invalid. 

4.  DATA  CONVERSION  EXAMPLES 

The  conversion  of  data  written  on  one  computer  to 
a  form  acceptable  to  another  computer  may  appear  to 
be  straightforward,  but  problems  may  arise.  If  a  record 
from  a  dataset  contains  non-numeric  data  in  some  fields 
and  numeric  data  stored  in  binary  or  some  other  form 
in  other  fields,  then  a  simple  conversion  of  the  entire 
record  is  not  possible.  A  common  situation  that  is  of¬ 
ten  encountered  is  the  conversion  of  EBCDIC  to  ASCII. 
For  such  situations,  the  statistician  must  distinguish  be¬ 
tween  fields  that  contain  EBCDIC  values  and  fields  that 
contain  numeric  values.  A  common  misconception  is 
that  numeric  values  expanded  into  EBCDIC  symbols  are 
stored  as  numeric  values.  An  example  serves  to  illustrate 
this  problem.  In  the  following,  the  symbol  is  used  to 
represent  an  unprintable  character. 


Record  Layout: 
Field  Columns 
NAME  1-20 

Format 

EBCDIC 

Value 

JONES,  ULYSSUS  P, 

AMOUNT 

LOSS 

Colum: 

21-24 

25-28 

n  Number: 

EBCDIC 

Binary 

$215 

$19,608 

1  1 

— 0 - 5— 

2  2 

— 0 - 6- 

DDDCE64EDEEEEE4D4444FFFF0049 
Record  in  Hex:  16552B0438224207B000021500C8 
Record  in  EBCDIC:  JONES,  ULYSSUS  P.  0216(S®<q 

Note  that  the  display  stacks  hex  digits  so  that  the 
relationship  between  what  is  seen  and  the  corresponding 
hex  representation  is  clear.  For  example,  the  character 
J  is  hex  Dl,  O  is  hex  D6,  etc.  The  digits  0  through  9 
are  represented  in  EBCDIC  hex  as  FO  through  F9. 

In  the  example,  AMOUNT  is  stored  in  EBCDIC 
rather  than  integer  binary.  This  is  wasteful,  because 
amounts  between  $0  and  $10,000  may  be  stored  in  bi¬ 
nary  using  two  bytes  rather  than  four  bytes,  because 

9999(decimal)  =  0010  0111  0000  llll(binary). 


Nonetheless,  it  is  convenient  to  store  data  in  readable 
form  when  small  numeric  quantities  are  involved.  The 
field  LOSS  is  stored  in  integer  binary  using  four  bytes, 
which  permits  dollar  amounts  up  to  $2,147,483,647.  Had 
EBCDIC  been  used,  ten  bytes  would  have  been  required. 
For  values  up  to  $8,388,607,  only  three  bytes  are  re¬ 
quired. 

For  the  example,  note  that 

19, 608(decimal)  =  4C98(hex) 

=  0100  1100  1001  lOOO(binary). 

If  the  four  byte  LOSS  field  was  interpreted  as  EBCDIC, 
then  the  value  would  be  treated  as  invalid.  The  repre¬ 
sentation  “@@<q”  cannot  be  interpreted  as  a  number. 
On  the  other  hand,  if  the  four  byte  AMOUNT  field  had 
been  interpreted  as  integer  binary,  the  amount  would 
have  been  read  as  $4,042,453,493.  The  lesson  is  that 
any  data  field  interpreted  as  integer  binary  will  produce 
valid  numeric  values,  but  EBCDIC  fields  may  contain 
values  that  cannot  be  interpreted  as  numeric. 

The  most  serious  consequence  of  poor  conversion 
is  that  integer  binary  fields  may  be  converted  from 
EBCDIC  to  ASCII,  thereby  producing  bogus  results. 
For  example,  converting  the  above  example  record  to 
ASCII  produces 

112  2 

Column  Number: - 5 - 0 - 5 - 0 - 6 - 

4444522545665525222233330037 
Record  in  Hex:  AFE53C05C9335300E000021500C1 
Record  in  EBCDIC:  JONES,  ULYSSUS  P.  0215(2Q<q 

Note  that  the  printable  characters  do  not  change,  but 
since  the  LOSS  field  was  not  meant  to  represent  printable 
symbols,  the  numeric  value  of  LOSS  has  changed.  In  this 
case,  LOSS  takes  the  value 

3C71(hex)  =  $15,473(decimal). 

For  many  situations,  descriptive  statistics  will  reveal  the 
conversion  mistake,  but  as  this  example  illustrates,  situ¬ 
ations  exist  where  the  mistake  would  not  be  obvious.  A 
small  simulation  reveals  that  the  problem  can  be  serious 
when  numeric  values  are  relatively  small.  The  follow¬ 
ing  table  contains  results  for  a  simulated  dataset  having 
500  observations  taken  from  a  Gamma  distribution  with 
shape  parameter  2  and  scale  parameter  100.  Values  were 
written  to  disk  as  four  byte  integer  binary  values.  The 
column  headed  “ASCII  Value”  represents  the  data  read 
in  after  using  the  UNIX  dd  command  EBCDIC-to-ASCII 
conversion  feature,  and  the  column  headed  “EBCDIC 
Value”  represents  the  data  read  in  after  using  the  dd 
command  ASCII-to-EBCDIC  conversion  feature. 
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Statistic 

Original 

Value 

ASCII 

Value 

EBCDIC 

Value 

No.  Obs. 

500.00 

500.000 

500.000 

Mean 

197.51 

207.242 

227.598 

Median 

159.50 

180.000 

208.000 

Std  Dev 

141.78 

143.748 

142.722 

Min 

6.00 

4.000 

5.000 

Max 

1013.00 

897.000 

876.000 

Skewness 

1.40 

1.121 

1.143 

Kurtosis 

3.13 

1.761 

1.682 

5%-ile 

36.50 

26.000 

52.500 

10%-ile 

52.50 

42.000 

79.500 

25%-ile 

94.50 

103.000 

117.000 

75%-ile 

275.50 

274.000 

293.500 

90%-ile 

391.50 

421.500 

461.000 

95%-ile 

472.50 

461.500 

496.000 

Many  programmers  who  have  used  the  built-in  fea¬ 
tures  of  programming  languages  are  not  aware  of  the  im¬ 
plications  of  performing  arithmetic  operations  or  storing 
numeric  values.  In  fact,  many  databases  store  numeric 
values  in  human  readable  form.  The  interaction  with 
^‘user-friendly”  software  or  hardware  can  lead  to  prob¬ 
lems  or  misconceptions. 

A  subtle  data  conversion  problem  occurs  when  data 
fields  contain  codes  that  may  have  special  meaning  to 
data  processing  software.  For  example,  if  a  software 
product  is  expecting  to  find  a  record  delimiter,  like  a  line¬ 
feed  (ASCII  hex  OA)  or  carriage  return  (ASCII  hex  OD), 
the  software  may  encounter  the  delimiter  in  a  packed 
decimal,  integer  binary,  or  floating  point  field.  If  a  record 
delimiter  is  encountered  within  a  field,  the  software  may 
assume  that  the  end  of  the  record  has  been  reached 
prematurely  and  truncate  the  record.  This  occurs  be¬ 
cause  the  algorithm  reads  a  record  into  a  buffer  up  to 
the  record  delimiter  before  it  attempts  to  interpret  the 
individual  fields  in  the  record.  It  may  be  necessary  to 
specify  record  sizes  rather  than  depend  on  record  delim¬ 
iters  when  special  numeric  storage  modes  are  used. 

When  file  transfer  occurs  using  FTP  or  some  other 
transfer  mechanism,  there  are  options  for  binary  or  text 
transfer.  A  fixed  length  record  data  set  with  numeric 
fields  as  described  in  the  previous  paragraph  may  be 
transferred  as  a  text  file.  In  particular,  suppose  the  file 
is  transferred  from  a  DOS  based  system  to  a  UNIX  sys¬ 
tem.  DOS  terminates  text  records  with  a  carriage  re¬ 
turn  and  linefeed,  hex  ODOA.  FTP  replaces  ODOA  with 
OD,  and  removes  the  file  terminating  lA  (Control-Z) 
which  DOS  uses  as  an  end-of-file  marker.  A  binary  field, 
say  0D0A(hex)=3338(decimal),  would  be  left  shifted 
one  byte  and  replaced  with  0A(hex)=:10(decimal).  The 
record  containing  this  value  would  be  corrupted.  This 
explains  why  data  files  are  almost  always  transferred  as 


binary  files. 

5.  DIAGNOSING  BAD  DATA 

The  most  obvious  approach  to  diagnose  bad  data  is 
to  flag  bad  values  and  generate  a  frequency  table  for  the 
flag.  This  approach  may  be  adequate  for  packed  decimal, 
zoned  decimal,  or  character  (printable)  data,  but  binary 
fields  will  always  be  valid,  and  floating  point  fields  will 
almost  always  appear  to  be  valid. 

For  numeric  storage  modes  like  integer  binary  and 
floating  point  binary,  statistical  summaries  may  be  ad¬ 
equate.  The  source  data  set  should  be  analyzed  to  de¬ 
rive  means  and  percentiles.  If  the  target  dataset  pro¬ 
vides  statistics  that  compare  to  within  expected  round¬ 
off,  then  it  is  unlikely  that  conversion  problems  occurred. 
Some  data  sources  will  not  have  expertise  or  resources 
to  calculate  percentiles,  so  one  solution  is  to  request  a 
sum  for  each  numeric  field  and  a  sample  dump  of  10  to 
20  records,  preferably  in  hex  and  human  readable  form. 

What  about  data  that  comes  from  an  “unfriendly” 
source?  This  may  occur  when  a  source  is  a  secondhand 
distributor  of,  say,  government  data.  The  source  may 
only  be  set  up  to  distribute  copies  and  may  have  no 
software  tools  to  validate  data.  For  these  situations, 
the  statistician  should  have  some  idea  of  what  to  ex¬ 
pect  from  the  numeric  fields.  A  binary  field  will  always 
produce  valid  results  even  if  non-binary  data  is  stored  in 
the  field,  and  a  floating  point  field  will  appear  to  have 
valid  values  over  99%  of  the  time  even  if  the  field  actually 
contains  binary  or  character  data.  On  the  other  hand, 
packed  decimal  and  zoned  decimal  fields  that  display  in¬ 
valid  numeric  values  are  probably  specified  incorrectly. 

6.  MISCELLANEOUS  CONVERSION 
ISSUES 

When  the  person  writing  the  program  to  create  a  tape 
dataset  has  no  direct  interest  in  the  statistical  analysis 
that  is  to  be  performed,  then  communication  problems 
may  lead  to  data  conversion  problems.  Following  are  a 
list  of  problems  that  arise  from  the  source  of  the  tape 
dataset. 

#  The  programmer  uses  a  system  cataloged  procedure 
without  understanding  the  defaults  that  are  em¬ 
ployed. 

♦  The  programmer  uses  a  proprietary  storage  mode 
supported  only  by  a  given  commercial  software 
product  that  may  not  be  available  on  the  system 
reading  the  data. 
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•  The  programmer  anticipates  EBCDIC  to  ASCII 
conversion  and  performs  the  conversion  at  the 
source  cite  without  considering  the  impact  on 
packed  decimal  fields  or  other  non-EBCDIC  fields. 

♦  The  programmer  uses  hardware  specific  storage 
modes,  such  as  floating  point  and  binary,  and  lacks 
the  sophistication  to  define  how  these  fields  are  in¬ 
terpreted  on  the  source  system. 

Single  purpose  programmers,  such  as  COBOL  pro¬ 
grammers  performing  systems  analysis,  may  work  for 
years  without  understanding  how  packed  decimal  fields 
are  actually  deciphered  by  the  software.  These  program¬ 
mers  may  also  use  tape  utilities  without  understanding 
how  they  work.  Tape  reading  and  writing  for  archiving 
and  other  purposes  is  routine,  but  sending  tapes  off  site 
may  be  rare.  Many  companies  have  been  collecting  data 
for  years,  but  are  just  now  beginning  to  realize  the  value 
of  data.  These  companies  are  likely  to  be  the  greatest 
source  of  data  conversion  problems. 

While  only  a  few  improper  conversions  are  difficult  to 
detect,  a  number  of  improper  conversions  may  be  diffi¬ 
cult  to  diagnose.  Integer  binary  being  converted  using 
EBCDIC-to-ASCII  conversion  tables  poses  a  dangerous 
problem  that  is  difficult  to  detect.  Trying  to  read  float¬ 
ing  point  data  as  packed  decimal  readily  reveals  a  prob¬ 
lem,  but  it  may  take  some  detective  work  to  deduce  the 
correct  numeric  format  that  should  have  been  used. 

7.  CONCLUDING  REMARKS 

The  number  of  seemingly  sophisticated  computer 
users  who  thought  EBCDIC  to  ASCII  conversion  should 
be  done  for  an  entire  dataset  rather  than  on  each  indi¬ 
vidual  field  of  a  record  was  surprising.  In  a  non-random 
sampling  of  associates,  over  90%  of  those  queried  thought 
that  data  from  an  IBM  system  should  be  transferred 
to  a  UNIX  system  using  EBCDIC-to-ASCII  conversion. 
Rather  than  reflecting  a  basic  ignorance  of  computer 
data  storage,  this  finding  probably  reflects  that  many 
statisticians  only  use  pure  ‘texU  storage  mode  for  data, 
which  means  that  pure  ASCII  or  EBCDIC  is  used. 

The  following  guidelines  for  the  source  of  the  data  will 
help  make  data  transfer  and  conversion  relatively  pain¬ 
less. 

1.  If  dataset  size  is  not  an  issue,  use  pure  EBCDIC  or 
pure  ASCII  to  store  data. 

2.  Avoid  using  floating  point  storage  mode  on  any  data 
that  is  to  be  transferred  to  another  computer  sys¬ 
tem. 


3.  If  system  dependent  storage  modes  like  packed  dec¬ 
imal  or  floating  point  storage  are  to  be  employed, 
include  a  complete  description  of  the  format  with 
the  dataset  documentation.  Complete  documenta¬ 
tion  would  allow  the  recipient  to  program  a  conver¬ 
sion  algorithm  from  scratch  if  software  tools  did  not 
exist  that  supported  the  format. 

4.  Include  statistical  summaries  for  all  numeric  fields, 
and  frequencies  for  categorical  fields,  along  with  a 
dump  of  some  representative  records  to  facilitate 
validation  on  the  recipients  end. 

Software  products  like  the  SAS  System®  provide 
tools  and  formats  that  make  data  transfer  and  conversion 
relatively  painless.  Base  SAS  software  provides  a  rich 
collection  of  data  conversion  formats,  including  hardware 
specific  formats  such  as  IBM  370  and  VAX  floating  point 
formats.  The  expository  articles  by  Langston  (1987)  and 
Klenz  (1992)  provide  insight  into  issues  related  to  float¬ 
ing  point  storage  of  data.  Kudlick  (1980)  is  an  older  text¬ 
book  with  details  about  IBM  System  370  numeric  stor¬ 
age  modes,  such  as  packed  decimal  and  floating  point. 

In  the  analysis  of  data  obtained  from  a  “foreign”  com¬ 
puter  source,  the  first  step  in  any  statistical  analysis 
should  be  to  verify  the  validity  of  the  data.  The  statisti¬ 
cian  who  proceeds  with  an  analysis  without  having  con¬ 
firmed  the  proper  transfer  and  conversion  of  data  is  as 
foolish  as  the  scientist  who  comes  to  a  statistician  for 
help  only  after  the  data  has  been  collected.  Academi¬ 
cians  who  rely  on  graduate  students  or  data  center  per¬ 
sonnel  should  be  particularly  cautious,  especially  if  their 
own  computer  skills  are  weak. 

The  SAS  System  is  a  registered  trademarks  of  SAS  In¬ 
stitute  Inc.,  Cary,  North  Carolina. 
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Abstract 

We  describe  a  set  of  object-oriented  S  functions  that 
harness  the  automatic  printing  facility  in  S  to  convert  an 
S  object  to  another  format.  We  devise  new  classes  and 
subclasses  of  objects  in  S:  file  (with  subclasses  source,  sas, 
dvi,  latex,  ps,  and  xli),  display,  and  expr  (closely  related  to 
objects  of  mode  expression);  print  methods  (families  of 
related  functions  that  depend  on  the  class  of  the  argu¬ 
ment)  for  the  new  classes;  and  other  families  of  functions 
that  depend  on  environmental  variables.  We  give  exam¬ 
ples  for  displaying  an  S  object  in  its  own  window  on  the 
user’s  workstation,  for  converting  a  data.frame  to  a  MjpC 
table,  for  displaying  S  help  files  in  a  IfeK  window  on  a 
workstation,  and  for  converting  a  data.frame  to  a  system 
file  for  another  software  system.  We  show  how  to  put 
a  SAS  program  inside  an  iterative  loop  controlled  by  S. 
We  give  applications  to  programming  and  debugging  in 
S.  We  discuss  design  issues  in  constructing  the  functions 
and  relating  the  individual  members  of  the  set  of  func¬ 
tions  to  each  other  and  to  the  object-oriented  paradigm 
in  the  underlying  S  program. 

KEY  WORDS:  S,  Display  software,  MeX,  Object- 
oriented  programming,  SAS,  System  Interfaces. 

1.  Introduction 

S  (Becker,  Chambers,  and  Wilks,  1988)  is  a  “Pro¬ 
gramming  Environment  for  Data  Analysis  and  Graph¬ 
ics”  originally  designed  for  Unix  computers.  S-Plus,  a 
supported  version  of  the  program,  including  a  port  to 
the  MS-DOS  environment,  is  available  from  Statistical 
Sciences,  Inc.  (1991). 

S  was  extended  by  Chambers  and  Hastie  (1992)  to  in¬ 
clude  an  object-oriented  programming  environment.  Ob¬ 
jects  in  the  environment  include  functions,  data,  and 
graphs.  Generic  function  names  in  the  language  are 
sensitive  to  the  class  of  their  arguments,  and  call  dif¬ 
ferent  methods  for  the  actual  execution.  For  example. 
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S  normally  prints  quotation  marks  around  character- 
valued  variables  but  suppresses  the  quotation  marks  for 
character-valued  columns  in  a  data.frame. 

An  interactive  S  session  consists  of  two  types  of  state¬ 
ments:  assignments,  in  which  the  value  of  an  object  or 
the  result  of  a  function  call  is  assigned  to  another  object; 
and  automatic  print  statements,  in  which  values  or  re¬ 
sults  not  explicitly  assigned  to  an  object  are  printed  by 
an  implicit  call  to  a  print  method.  An  easy  way  to  con¬ 
trol  the  behavior  of  a  program  is  to  define  new  classes  of 
data  and  associated  print  functions. 

In  this  paper  we  introduce  three  new  classes  of  data, 
display,  file,  and  expr.  The  display  class  is  used  to  change 
the  destination  of  a  print  statement.  Under  normal  cir¬ 
cumstances,  printed  information  goes  to  the  standard 
output.  Objects  with  class=:" display”  are  printed  to  the 
"display"  device,  an  alternative  destination  defined  either 
in  an  environmental  variable  options()$display  or  as  an  ar¬ 
gument  to  an  explicit  call  to  the  print.display()  function. 
Typical  display  destinations  are  text  editors  in  indepen¬ 
dent  windows  on  a  display  screen,  printers,  or  pipes  into 
other  software  programs.  A  family  of  functions  display.* 
has  been  defined  to  be  sensitive  to  the  environmental 
variable  specifying  the  display  destination. 

The  mechanics  of  the  display  class  lead  to  the  second 
new  class  of  objects.  Objects  of  class  display  are  printed 
to  a  system  file  (using  the  sink  function)  in  the  under¬ 
lying  operating  system  (usually  Unix)  and  the  name  of 
the  file  is  returned  as  an  object  of  class  file.  Objects  of 
class="flle”  are  printed  by  locating  the  operating  system 
file  and  printing  it  by  a  method  appropriate  to  its  sub¬ 
class.  There  are  various  subclasses  of  the  file  class:  latex 
files  contain  input  to  the  BTeX  text  processing  software 
(Lamport,  1986),  dvi  files  (device-independent)  contain 
the  results  of  processing  by  BTjjX,  ps  files  contain  infor¬ 
mation  in  the  page  description  language  PostScript,  source 
files  contain  the  ascii  definitions  of  S  objects.  The  de¬ 
fault  print  method  for  a  file  object  is  to  display  the  ascii 
text  of  the  file  on  a  display  device.  The  print  methods  for 
latex,  dvi,  and  ps  use  the  unix()  function  to  call  operating 
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system  commands  for  each  of  these  file  types.  The  print 
method  for  source  uses  the  S  command  source  to  bring 
revised  function  definitions  into  the  working  directory. 

The  third  new  class,  expr,  holds  statements  in  the  S 
language  of  mode  expression.  By  making  expressions  a 
class,  not  just  a  mode,  we  can  harness  the  automatic 
print  facilities  to  aid  in  debugging.  Since  we  are  chang¬ 
ing  the  behavior  of  the  print  method  for  many  objects, 
we  must  use  the  default  print  method  to  find  out  what 
objects  we  have  constructed.  By  defining  the  object  L 
and  the  print  method  print.expr: 

>  L  expression(print.default(.Last.value)) 

>  class(L)  ^  "expr" 

> 

>  print.expr  ^  function(x)  eval(x) 
typing 

>  L 

has  the  same  effect  as  typing  the  longer  statement 

>  print.defauIt(.Last.value) 

The  print  methods  are  implemented  as  a  family  of  func¬ 
tions  print.*.  The  print.display  method  is  implemented  by 
a  display.*  family  of  functions.  Examples  of  both  the  au¬ 
tomatic  and  explicit  use  of  print.display  are  in  Section  2. 
We  describe  the  design  of  the  functions  in  Section  3. 

The  mechanism  leads  to  a  very  general  interface  with 
other  software  systems,  described  and  illustrated  in  Sec¬ 
tion  4.  The  statement  print. display(x,  display=" latex”)  de¬ 
scribed  in  Section  4.3  converts  the  S  object  x  to  a  file 
that  can  be  input  to  the  BTjpC  typesetting  system.  The 
statement  print.display(x,  display="sas")  described  in  Sec¬ 
tion  4.5  creates  a  SAS  dataset  (SAS  Institute  Inc.,  1990). 

2.  Display  of  Objects 

It  is  often  helpful  to  have  an  image  of  the  data  visible 
in  a  separate  window  from  the  one  in  which  the  S  session 
is  itself  running.  We  provide  a  set  of  functions  to  display 
text  images  in  their  own  windows.  We  assume  we  are 
working  with  an  X- window  workstation  abd  tell  S  which 
display  device  to  use  with  the  options(display="xedit”) 
statement  at  the  beginning  of  the  S  session: 

>  opt]ons(display="xedit")  #X  using  xedit 

We  know  that  we  will  be  working  with  the  S  object 
my.data.frame  and  decide  to  assign  it  the  "display”  class: 

>  class(my.dala.frame)  c("display” ,  class(my.data .frame)) 


Note  that  we  have  placed  the  new  class  first,  so  the  au¬ 
tomatic  print  mechanism  will  find  it  first,  and  retained 
all  previous  classes.  We  can  now  print  the  data.frame  to 
the  display  just  by  typing  its  name: 

>  my.data.frame 

The  automatic  print  mechanism  recognizes  this  to  be  an 
object  of  class  display  and  sends  it  to  the  print  method 
print.display.  Subscripting  retains  the  class  of  the  object. 

3.  Function  Design 

The  function  print.display  has  been  designed  as  a  print 
method  for  objects  of  class=— "display".  Any  object 
for  which  class(object)[l]  ==  "display"  is  automatically 
printed  with  the  print.display  function.  Any  other  S  ob¬ 
ject  can  be  forced  to  print  on  the  display  by  explicitly 
using  print.display.  The  function  takes  additional  argu¬ 
ments  of  two  types.  First,  it  takes  general  arguments 
(width=  and  length=)  to  prevent  folding  of  long  lines,  and 
otherwise  take  advantage  of  scroll  bars  in  the  displayed 
window.  Second,  it  takes  device-specific  arguments  that 
allow  user  control  of  fonts  and/or  pagination  on  the  dis¬ 
play  device  (X.flags=:,  lpr.flags=,  lp.flags=,  pr.flags=).  We 
have  provided  specific  functions  in  the  display.*  family  of 
function  names  for  18  different  display  devices  (window 
systems,  screen  editors,  typesetters,  printers,  software 
systems).  It  is  easy  for  a  user  with  a  different  software 
preference  or  hardware  availability  to  add  another  simi¬ 
lar  function. 

The  visible  effect  of  the  print.display  function  is  the  ap¬ 
pearance  of  an  ascii  image  of  the  object  on  a  display 
device.  The  mechanism  by  which  this  happens  is  im¬ 
portant.  We  construct  an  intermediate  file,  using  the  S 
sink  function,  and  forward  that  file  to  the  print. ascii  func¬ 
tion,  the  default  print  method  for  objects  of  class  file. 
We  optionally  return  the  name  of  the  file  in  the  "file” 
attribute  of  the  "result"  attribute  of  the  print.display  func¬ 
tion.  The  print.ascii  function  prints  the  file  on  whatever 
device  has  been  defined.  Display  devices  can  be  defined 
with  options(display=" xedit”)  or  by  an  explicit  argument 
to  the  print.display  function. 

Several  of  our  functions  create  intermediate  files  in 
other  formats,  for  example,  display.latex  creates  dvi  files 
(class=c("dvi","file")),  which  are  in  turn  automatically 
printed  by  print.dvi.  We  have  therefore  provided  sev¬ 
eral  related  functions  to  allow  direct  manipulation  of  dvi, 
PostScript,  and  scanned  image  files.  We  give  applications 
using  these  display  techniques  and  discuss  the  construc¬ 
tion  of  the  sets  of  functions  designed  to  work  with  them. 
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3.1  print.display  as  a  Method  for  the  print  Function 

The  initial  impetus  was  the  Vars  function  (Har¬ 
rell  1992a),  which  collected  supplementary  information 
about  data.frames  (class,  factor  levels,  formats,  variable 
labels)  and  displayed  it  in  a  window.  The  motivation  for 
the  present  paper  was  the  recognition  that  Vars()  was  a 
combination  of  two  separable  functions.  First,  it  queried 
a  data  .fra me  and  constructed  a  summary  of  the  supple¬ 
mentary  information.  Second,  it  displayed  its  results  in 
a  window  on  the  display  screen.  In  this  paper,  we  fo¬ 
cus  on  the  design  of  the  display  technology  and  use  the 
initial  application  as  an  example. 

The  user-level  function  print.display  is  the  print  method 
for  objects  of  class  display.  It  acts  like  an  ordinary 
method,  in  that  its  behavior  depends  on  the  class 
of  its  argument.  It  differs  from  an  ordinary  method 
due  to  its  dependence  on  the  normally  hidden  generic 
print. ascii  function,  print. ascii  is  aware  of  its  environment, 
more  specifically  of  the  value  of  several  components 
of  the  options()  vector,  principally  of  options()$display. 
When  options()$display  is  NULL  the  visible  behavior  of 
print.display  is  identical  to  that  of  print:  the  output  is 
sent  to  the  standard  output  connected  to  the  S  process. 
When  options()$display  is  non-NULL,  the  print.ascii  func¬ 
tion  sends  the  output  file  created  by  print.display  to  the 
appropriate  display  device. 

The  display  construct  generalizes  to  include  print¬ 
ers  and  software  interfaces.  We  provide  definitions  of 
displays' Ipr",  display="lp"  for  Unix  printer  spools.  In  Sec¬ 
tion  4.3  we  describe  the  function  display.latex  to  convert 
an  S  data.frame  to  a  MgX  tabular  environment.  In  Sec¬ 
tion  4.5  we  describe  the  function  display.sas  to  convert  an 
S  data.frame  to  a  SAS  data  file.  In  both  examples,  we 
have  options  that  allow  the  target  programming  system 
to  be  executed  under  S  control. 

3.2  Family  of  display.*  Functions  and  the 
Programming  Environment 

The  new  generic  function  print.ascii  queries  the  value 
of  options()$disp!ay  to  determine  the  specific  display  de¬ 
vice,  say  it  finds  "X",  and  then  forwards  the  temporary 
file  constructed  by  print.display  and  any  additional  argu¬ 
ments  to  the  function  dispIay.X.  The  function  display.X 
uses  the  arguments  and  any  additional  options  and  then 
constructs  and  executes  a  Unix  command  for  the  dis¬ 
play.  End  users  will  need  to  set  the  option(display="X"), 
but  will  otherwise  not  generally  work  directly  with  the 
display.*  functions. 

The  display.*  functions  are  similar  in  behavior  to  the 
generic  functions  and  associated  methods  of  S,  in  that 
the  user  calls  the  generic  print.ascii  and  lets  it  decide  how 


it  should  behave.  The  difference  is  that  the  print.ascii 
function  depends  on  environmental  information,  specifi¬ 
cally,  the  value  of  components  of  the  options()  vector,  to 
determine  its  behavior.  The  S  generic  functions  depend 
on  only  the  class  of  their  argument. 

3.3  print.file  as  a  Method  for  the  print  Function 

The  function  print.file,  a  method  of  the  generic  print, 
is  itself  a  generic  function  with  methods  print.ascii, 
print.latex,  etc.  The  function  print.ascii,  the  default 
method  for  objects  of  class  "file”,  depends  on  the  en¬ 
vironmental  information  of  the  options()$display. 

Objects  in  class="file”  consist  of  vectors  of  file  names. 
A  typical  example  is  the  vector  of  names  of  files  resulting 
from  a  call  such  as  tmp2  from  the  function  call: 

>  tmp2^print.latex(''z.tex”  ,dvi.command=”dvips’'  ,safe=F) 

>  #  Side  effect:  the  file  "/tuip/z3906.ps”is  printed  using 

>  #  the  method  defined  in  options()$ps. command 

In  this  example,  the  print.latex  function  took  the  Unix 
file  name  z.tex  containing  a  fragment,  appended 

the  missing  statements  to  created  an  expanded  MjjX 
file,  typeset  the  expanded  file  to  create  a  dvi  file,  sent 
the  dvi  file  to  dvips  for  conversion  to  PostScript,  and 
printed  the  PostScript  file  using  the  method  defined  in 
options()$ps.command. 

4.  Applications 

4.1  Information  About  a  Data. Frame 

The  motivating  application  Vars  is  the  display  of  sum¬ 
mary  information  about  a  data.frame: 

>  my.data  .frame  ♦— 

-h  data.frame(x=l:2,  y=factor(c(”a”;'b")), 

•f  q=structur€(3:4,  label=”  Z” )) 

>  my.data  .frame 

X  y  q 

1  1  a  3 

2  2  b  4 

>  Vars(my.data .frame)  #  default:  sort  by  variable  name 

Label  Class  Levels 

q  z 

x 

y  factor  a  b 

The  function  Vars  returns  an  object  of  class  "display”, 
therefore  the  result  of  the  Vars  function  goes  directly  to 
the  display  (in  this  example,  the  terminal). 

4.2  Contents  of  a  data.frame 

We  often  wish  to  view  the  contents  of  a  data.frame  while 
constructing  or  interpreting  an  analysis.  Say  we  have  a 
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data.frame  with  26  variables,  and  we  are  currently  study¬ 
ing  a  model  based  on  columns  11:15.  The  statement 

>  options(display=”jot”)  #  editor  with  SGI 

>  print.display(cars93[,c(ll:15.1:10, 16:26)], 

+  X.flags=”-font  CourierlO”,  width=:280) 

displays  the  reordered  columns.  In  addition,  the  width 
has  been  increased  so  each  row  of  the  data.frame  appears 
on  one  row  of  the  output  file.  The  small  font  allows  more 
columns  on  the  screen  simultaneously.  The  editor^s  scroll 
bars  are  used  to  move  around  in  the  window. 

4.3  Display  S  Objects  in  Documents 

An  S  data.frame  is  often  displayed  in  MTjjK  table 
and/or  tabular  environments.  We  provide  the  function 
display.latex,  a  member  of  the  display.*  family,  to  perform 
the  conversion.  At  user  level,  the  commands  are: 

>  print.display(my.object, 

+  display=”  latex” ,  dvi.command="xdvi" ) 

The  function  display.latex  uses  the  generic  function  latex, 
latex  converts  data.frame  and  matrix  objects  using  the 
specific  function  latex.default,  an  enhancement  of  the 
lat€x.table  package  (Harrell  1992b,  1992c).  Function  ob¬ 
jects  are  converted  by  latex.function  using  either  the  stan¬ 
dard  BTjiX  verbatim  environment  or  the  S  Example  envi¬ 
ronment  (Chambers  and  Hastie  1993).  Lists  are  con¬ 
verted  by  latex.list,  a  recursive  function  that  calls  the 
generic  function  for  each  element  of  the  list. 

The  display.latex  function  is  more  complex  than  the 
other  members  of  the  display.*  family.  The  others,  so 
far,  have  been  essentially  alternate  output  destinations 
for  the  mono-width  ascii  font  produced  by  the  generic 
print  function.  The  BT^X  program  uses  the  complete 
data.frame  structure  of  the  actual  S  object.  It  finds  the  S 
object  in  the  frame  of  the  function  that  called  display.latex 
(by  backing  up  through  the  calling  sequence  using  the 
sys.parent  function)  and  sends  that  object  to  the  generic 
latex  function.  Users  may  wish  to  call  latex  directly. 

The  latex.default  function  uses  format.df,  a  stand-alone 
function  based  on  HarrelFs  latex.table  package.  Nu¬ 
meric,  factor,  and  character  data  (including  imbedded 
blanks)  are  correctly  formatted.  Matrix  components  of 
a  data.frame  are  recognized.  The  function  name  format.df 
indicates  that  the  function  is  a  model  for  a  method  de¬ 
signed  for  data.frame  objects. 

The  primary  result  of  the  display.latex  function  is  a 
Kr^X  input  file  fragment.tex  that  will  be  pasted  into  a 
complete  document.  Secondary  results  are  the  execu¬ 
tion  of  the  latex  program  on  the  fragment,  and  display 
of  the  result  with  the  print.dvi  method,  print.dvi  uses 
another  family  of  functions,  with  generic  function  dvi 


and  specific  functions  dvi.*,  for  the  display  of  the  file.dvi 
files  constructed  by  the  latex  function.  Four  examples 
are  provided,  dvi.xdvi  for  X-windows,  dvi.dvips  for  con¬ 
version  to  PostScript,  and  dvi. Ip  and  dvi.lpr  for  line  print¬ 
ers.  When  dvi.dvips  is  used,  one  of  two  additional  ar¬ 
guments  (dvi ps.com mand  or  ps.command)  may  be  used  to 
define  a  command  (ghostview,  Ip,  or  Ipr,  for  example)  to 
view  the  PostScript.  When  dvips.command  is  used,  dvips 
output  is  piped  directly  to  the  Unix  command,  usually  a 
printer  spooler.  When  ps.command  is  used,  dvips  creates 
a  PostScript  file  and  then  calls  print.ps  to  display  the  file, 
usually  on  a  screen  viewer  that  might  need  to  re-read 
the  input  file  to  display  an  earlier  page. 

4.4  Display  of  Related  Files 

The  next  example  uses  the  vector  of  file  names  of  class 
’’file”.  We  have  a  data.frame  constructed  by  entering  data 
collected  by  means  of  a  multi-page  data  collection  form. 
The  form  is  stored  on  the  computer  system  in  a  set  of 
files,  one  per  page.  There  are  occasions  while  studying 
the  data  when  we  wish  to  see  the  image  of  the  paper 
data  collection  form. 

We  construct  an  S  variable  form. page  that  records  the 
page  number  in  the  form  from  which  each  variable  was 
taken.  For  example, 

>  my.data.frame  +-  data.frame(x=matrix(l:12,3)) 

>  names( my.data.frame)  <— 

•f  c(”  id” age” cholesterol” pulse”  ) 

>  form. page  4—  paste(”Study.R93,124",c(l,2,2,3),sep=”/”) 

>  names(form.page)  4-  names(my.data.frame) 

>  class(form.page)  4—  "file” 

>  print.default(form.page[c(l,4)]) 

id  pulse 

”  Study.R93.124/l”  ”  Study.  R93. 124/3” 

attr(,  "class”): 

[1]  "file" 

The  structure  of  form. page  says  that  id  comes  from  page 
1  of  the  form  and  pulse  comes  from  page  3. 

Now  when  we  wish  to  examine  a  particular  page  of  the 
form,  based  on  the  current  view  of  the  data,  we  can  just 
call  it  up.  We  assume  the  form  is  stored  in  ascii  text: 

>  form. page[" age"]  #  print  page  with  "age"  question 

Since  form. page  is  a  vector  of  class  "file",  the  referenced 
page  is  automatically  printed  by  the  print.file  method. 
This  system  generalizes  to  any  subclass  of  file. 

4.5  Interface  to  Programs,  SAS  as  an  Example 

The  print.display  function  takes  an  S  object  and  con¬ 
verts  it  to  another  format.  The  examples  so  far  have 
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been  conversions  to  visual  formats.  Conversion  to  the 
system  file  format  of  another  software  system  also  fits 
the  general  definition.  Indeed,  the  *.tex  file  format  is 
also  an  example  of  input  to  another  software  system. 

In  this  section  we  show  the  conversion  to  the  format  of 
another  statistical  system.  We  use  SAS  (SAS  Institute, 
1990)  as  the  example.  The  function  display.sas  is  a  mem¬ 
ber  of  the  display.*  family  of  functions.  It  is  more  complex 
than  most  members  of  the  family  as  it  must  prepare  not 
only  a  data  file  but  also  a  stdin  file  that  gives  instructions 
to  SAS  to  reproduce  the  variable  names  and  the  numeric 
or  character  values  of  the  variables.  It  uses  the  complete 
data.frame  structure  of  the  actual  S  object  by  backing  up 
through  the  calling  sequence  (using  the  sys.parent  func¬ 
tion  in  the  same  way  as  does  display.latex).  Users  will 
usually  choose  to  call  sas  directly.  Matrix  components  of 
the  S  data.frame  are  separated  into  individual  columns. 
Character  and  factor  data  are  identified  as  character  in 
the  stdin  file.  Missing  values  in  numeric  data  and  imbed¬ 
ded  blanks  in  character  data  are  converted  correctly,  sas 
also  uses  the  format.df  function. 

The  simplest  application  is  moving  the  data  for  anal¬ 
ysis  or  display  using  a  technique  that  is  available  else¬ 
where.  An  extension  of  the  simple  application  is  placing 
the  conversion  inside  an  iteration  loop  in  S  and  using 
the  S  function  to  drive  an  iterative  technique  using  pro¬ 
cedures  available  in  the  other  program.  A  working  exam¬ 
ple  of  the  iteration  loop  is  included  with  the  distribution. 

4.6  Use  of  nroff/troff  or  for  S  Help  Files 

The  help  files  in  S  are  written  in  nroff/troff.  When  the 
nroff  files  are  displayed  in  the  S  window  they  often  run 
off  the  screen  and  are  not  visible  at  the  same  time  as 
the  command  for  which  they  provide  guidance.  We  pro¬ 
pose  three  methods  to  print  them  in  their  own  window 
elsewhere  on  the  display  screen 

The  first  method  is  a  simple  revision  of  the  help  func¬ 
tion  in  S.  The  help.display  function  places  the  nroff  output 
on  a  temporary  text  file,  then  sends  the  name  of  the 
temporary  file  to  the  display  command. 

The  second  method  is  more  complex,  but  often  needed 
because  the  nroff  program  is  an  option,  not  automatically 
distributed  with  Unix  systems.  The  function  help.tex  con¬ 
verts  the  nroff  source  files  from  the  .Data /.Help  directory 
to  MjjK  (using  the  doc_to-tex  files  from  Chambers  and 
Hastie  (1993)),  runs  the  conversion  through  the  MjX 
program,  and  displays  the  dvi  file  on  the  display  screen. 

The  third  method  sends  the  troff  output  to  the  screen 
using  a  preview.troff  program  with  the  sequence 

IRIS  61%  setenv  S-LP  preview.troff 

IRIS  62%  S 

>  help({unction.name,  offrme=T) 
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Abstract 

Most  computer  users  are  not  aware  that 
placing  a  numerical  value  within  a  computer  will 
result  in  a  different  stored  value  than  the  one  used  for 
input  and  commit  this  rounding  error  without  obvious 
warning.  In  general,  the  output  of  a  numerical 
computation  program  can  only  provide  an 
approximated  solution  and  execution  of  a  numerical 
computation  on  one  machine  is  different  from  the 
output  from  another.  Therefore,  numerical 
computations  are  often  not  reliable.  However,  a 
method  called  interval  arithmetic  may  be  the  solution 
for  detecting  the  maximum  error  of  the  given 
computation  problem.  This  paper  addresses  the  use 
of  LISP  programming  techniques  to  develop  reliable 
components  for  interval  arithmetic.  The  LISP 
programs  are  machine  independent  and  provide  self- 
validated  computation. 

1.  Introduction 

All  computer  systems  are  finite  state 
machines,  they  are  not  able  to  deal  with  the 
computation  of  real  numbers.  They  can  only  compute 
a  finite  subset  of  rational  numbers.  Consequently, 
any  numerical  computation  using  computer  systems 
will  involve  rounding  errors  and  propogated  errors, 
and  the  solution  is  only  an  approximation  to  certain 
problems.  A  general  question  should  be,  “What  is  the 
size  of  error  in  the  result?”  In  resent  years,  some 
mathematicans  developed  a  technique  for  keeping 
track  of  errors.  This  technique  is  called  interval 
computation  or  analysis.  These  mathematicans 
considered  for  each  real  number  a:,  there  is  an  interval 
[a,  b]  such  that  x  G  [a,  i]  where  a  and  b  are  real 
numbers.  They  treated  an  interval  as  a  new  kind  of 
number.  In  this  treatement,  each  real  number 
induced  two  real  numbers  for  computation.  In 
general,  both  of  these  two  real  numbers  are  not  able 
to  be  represented  correctly  in  computer  systems.  The 
main  deficit  is  that  the  length  of  the  resulting  interval 
is  too  big. 

To  overcome  the  deficit,  we  need  to  create  the 


shortest  possible  intervals  for  all  initial  values  for 
computation.  In  this  paper,  we  consider  for  each  real 
number  r,  two  computer  floating-point  numbers  a  and 
h  such  that  a:6[a,  6],  where  h-a  is  the  shortest  interval 
the  underlying  hardware  computer  system  can  provide 
and  the  resulting  interval  will  be  the  smallest  we  can 
get.  High  level  programming  languages  that  are 
commonly  used  for  numerical  computation  like 
FORTRAN,  the  C  language,  even  the  C-f-f-  language 
are  not  able  to  implement  this  kind  of  computation. 
The  author  has  found  that  the  LISP  is  a  suitable 
language  to  implement  this  interval  computation.  We 
have  developed  some  fundamental  functions  using  the 
Common  LISP  programming  language  [6].  The 
reasons  for  using  the  Common  LISP  are  that  the 
programs  we  developed  in  LISP  are  machine 
independent.  LISP  does  not  have  a  size  limitation  for 
integers  thus,  the  language  can  simulate  any  floating¬ 
point  number  format  with  arbitrary  number  of  bits  in 
the  maniissay  and  any  complicated  computational 
problem  only  requires  some  basic  programming  skill 
to  implement  a  complex  and  difficult  algorithm. 

A  mathematical  foundation  that  supports 
interval  computation  is  given  in  section  2,  Functions 
for  interval  computation  is  developed  in  the  section  3. 
Some  examples  is  given  in  Section  4.  Finally, 
conclusion  is  followed. 

2.  Mathematical  Foundation 

In  1966  mathematican  R.  Moore  [3]  proved 
an  important  theorem  that  supports  interval 
computation  and  since  then  interval  computation  has 
become  a  new  and  growing  branch  of  applied 
mathematics.  We  will  call  this  theorem  the 
fundamental  theorem  of  interval  computation.  It  is 
necessary  to  state  the  theorem  here  to  support  our 
work. 

Theorem  Let  f(xi,  ^2,  .  .  .  ,  %)  5e  a  rational 
function  of  n  variables.  Consider  any  sequence  of 
arithmetic  setps  which  serve  to  evaluate  f  with  given 
arguments  a:2,  .  .  .  ,  r^.  Suppose  we  replace  the 
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arguments  by  corresponding  interval  (i  =  1,  2, 

.  .  .  ,  n)  and  replace  the  arithmetic  steps  in  the 
sequence  used  to  evaluate  f  by  the  corresponding 
interval  arithmetic  steps.  The  result  will  be  an 
interval  f(Xi,  X2,  .  .  .  ,  Xn).  This  interval  contains 
the  value  of  a:2)  •  •  *  >  ®n)  (*  == 

2,  .  .  .  ,  n). 

The  proof  of  this  theorem  was  given  by  Moore 
[3].  We  will  present  an  example  to  justify  the  result 
of  this  theorem. 

An  Example 

Consider  the  function  of  one  variable 
f(a:)  =  I  —  2). 

Suppose  that  we  evaluate  this  function  with  interval 
argument  X=  [— 3,  3].  We  first  compute 

x"=  [0,  9], 

and 

X2-2  =  [-2,  7], 

then  compute 

X(X*-2)  =  [-21,  21]. 

The  Theorem  guarantees  that 

-21  <  f(z)  <  21  for  all  z6  [-3,  3]. 

The  actual  range  of  f(z)  for  z  G  X  is  [—21,  21].  We 
have  obtained  exact  bounds  on  the  range  of  f  by  an 
evaluation  of  f  with  an  interval  argument  X. 
Anyhow,  f  has  both  a  minimum  and  maximum  within 
interval  X. 

3.  The  Interval  Arithmetic 

The  basic  argument  to  use  interval  arithmetic 
instead  of  real  numbers  is  to  provide  an  error  bound 
in  solving  a  numerical  computation.  In  numerical 
analysis,  interval  arithmetic  will  provide  an  interval 
that  includes  all  the  values  in  the  range  of  the  given 
mathematical  expression  over  the  designated  domain 
for  every  computation  step.  The  error  is  controlled 
within  this  interval.  Some  early  work  in  this  area  are 
given  by  Aberth  [1],  Alefeld  and  Herzberger[2], 
Rotschek  and  Rokne  [5]  and  Moore  [3,  4]. 


Let  A  =  [a^,  62],  and  B  =  [02^  ^2] 
closed  intervals  of  real  numbers.  Then  the  interval 
arithmetic  operations  are  generally  defined  as 

(1)  Addition 

A  +  B=:  [aj,  61]  +  [a2,  62] 

=  [fli+  a2,  4i+  62] 

(2)  Subtraction 

A  “  B  =  [aj,  b^]  -  [a2)  62] 

=  [cj-  62)  ^2] 

(3)  Multiplication 

A  *  B  =  [aj,  61]  *  [a2,  63] 

=  [min(aja2j  ^i^2j  ^i^2>  ^1^2)5 

max(aja2,  ai62>  ^1^2)] 

(4)  Division 

1/B  =  [l/^u  if  0  is  not  in  B 

A/B  =A*(1/B) 

=  [min(ai/62)  <^1/^2?  ^i/^2>  ^i/^2)> 
max(ai/62,  bjb^y  bja^)] 

In  a  special  case,  if  a  real  number  z  is  a 
floating-point  number  with  respect  to  a  given  machine 
then  a  degenerate  interval  [z,  z]  is  used  for 
computation. 

4.  The  Implementation 

There  are  two  different  ways  to  represent  a 
real  number  in  binary  digits,  some  computer  systems 
use  downward  rounding  and  others  use  upward 
rounding.  In  our  implementation,  we  assume  that  the 
computer  system  is  using  downward  rounding.  For  an 
inputted  real  number,  the  system  will  provide  us  an 
output  for  the  lower  bound  of  the  interval.  The  upper 
bound  of  the  interval  is  obtained  by  adding  1  to  the 
last  bit  of  the  lower  bound.  This  method  will  assure 
that  the  obtained  interval  is  the  smallest  interval  for  a 
given  real  number.  The  number  of  digits  in  the 
binary  representation  is  arbitrary  and  is  determined 
by  the  user  at  run  time. 
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We  have  developed  a  manual  driven 
procedure  in  LISP  to  handle  interval  arithmetic.  The 
procedure  includes  six  functions:  addition,  subtraction, 
multiplication,  division,  conversion  real  to  binary,  and 
conversion  binary  to  real.  The  precision  is  arbitrary 
and  determined  or  entered  by  user  at  run  time.  This 
can  provide  the  most  accurate  computation  for  the 
given  underlying  hardware.  To  execute  each  function, 
the  user  inputs  two  real  numbers,  x  and  and  the 
number  of  desired  binary  digits.  Then  the  procedure, 
for  each  real  number,  will  return  an  interval  that 
contains  the  real  number  in  binary  digits.  Next,  the 
procedure  returns  the  answer  of  the  operation  in  two 
forms:  a  binary  interval,  and  a  decimal  interval.  The 
format  is  shown  below: 

Select  your  function: 

addition,  subtraction,  multiplication,  or  division 

Enter  a  number:  a  real  number 

Enter  the  number  of  digits:  an  integer 

The  lower  bound  in  binary  digits 

The  upper  bound  in  binary  digits 

Enter  a  number:  a  real  number 
Enter  the  number  of  digits:  an  integer 

The  lower  bound  in  binary  digits 
The  upper  bound  in  binary  digits 

Result  =  [binary  lower  bound,  binary  upper  bound] 

=  [decimal  lower  bound,  decimal  upper  bound] 

5.  Sample  Results 

We  have  carefully  tested  the  procedure  and 
found  that  the  procedure  does  predefined  functions 
correctly.  The  LISP  program  was  run  on  a  MicroVax 
II  machine  with  the  ULTRIX  Version  2.0  operating 
system.  Some  sample  output  are  given: 

1.  Addition 

Enter  a  number:  3.56 
Enter  the  number  of  digits:  60 

The  lower  bound 

=11.1000111101011100001010001111010111000010100 

01111010111000010 

=3.55999999999999999951427742672649401356466114 

5210266113281250 


The  upper  bound 

=11.1000111101011100001010001111010111000010100 

01111010111000011 

=3.56000000000000000038163916471489756077062338 

5906219482421875 

Enter  a  number:  3.5 

Enter  the  number  of  digits:  20 

The  binary  number 

=  11.10000000000000000000 

The  lower  bound 
=  11.10000000000000000000 
=  3.50000000000000000000 

The  upper  bound 
=  11.10000000000000000000 
=  3.50000000000000000000 

The  result 

=[111.0000111101011100001010001111010111000010 

10001111010111000010, 

111.0000111101011100001010001111010111000010 

10001111010111000011] 

=[7.059999999999999999514277426726494013564661 

145210266113281250, 

7.060000000000000000381639164714897560770623 

385906219482421875] 

2.  Subtraction 

Enter  a  number:  3.56 
Enter  the  number  of  digits:  60 

The  lower  bound 

=11.1000111101011100001010001111010111000010100 

01111010111000010 

=3.55999999999999999951427742672649401356466114 

5210266113281250 

The  upper  bound 

=11.1000111101011100001010001111010111000010100 

01111010111000011 

=3.56000000000000000038163916471489756077062338 

5906219482421875 

Enter  a  number:  2.56 
Enter  the  number  of  digits:  60 
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The  lower  bound 

=10.1000111101011100001010001111010111000010100 

01111010111000010 

=2.55999999999999999951427742672649401356466114 

5210266113281250 

The  upper  bound 

=10.1000111101011100001010001111010111000010100 

01111010111000011 

=2.56000000000000000038163916471489756077062338 

5906219482421875 

The  result 

=[0.1111111111111111111111111111111111111111111 

11111111111111111, 

1.0000000000000000000000000000000000000000000 

00000000000000001] 

=[0.9999999999999999991326382620115964527940377 

59304046630859375, 

1.0000000000000000008673617379884035472059622 

40695953369140625] 

3.  Multiplication 

Enter  a  number:  2.45 
Enter  the  number  of  digits:  60 

The  lower  bound 

=10.0111001100110011001100110011001100110011001 

10011001100110011 

=2.44999999999999999982652765240231929055880755 

1860809326171875 

The  upper  bound 

=10.0111001100110011001100110011001100110011001 

10011001100110100 

=2.45000000000000000069388939039072283776476979 

2556762695312500 

Enter  a  number:  2.63 
Enter  the  number  of  digits:  40 

The  lower  bound 

=  10.1010000101000111101011100001010001111010 
=  2.6299999999991996446624398231506347656250 

The  upper  bound 

=  10.1010000101000111101011100001010001111011 
=  2.6300000000001091393642127513885498046875 


The  result 

=[110.01110001100010010011011101001011110001000 

111111111111111111101111001010110000001000001 

1000100100111000000000000000000000, 

110.01110001100010010011011101001011110001101 

111001100110011010101001101110100101111000110 

1010011111110000000000000000000000] 

=[6.443499999998039128966745292537293749470232 

79471680992059521020376422484332579188048839 

569091796875000000000000000000000, 

6.443500000000267393267250337629623815452620 

771557605414614142147478048627817770466208457 

946777343750000000000000000000000] 

4.  Division 

Enter  a  number:  3.63 
Enter  the  number  of  digits:  40 

The  lower  bound 

=  11.1010000101000111101011100001010001111010 
=  3.6299999999991996446624398231506347656250 

The  upper  bound 

=  11.1010000101000111101011100001010001111011 
=  3.6300000000001091393642127513885498046875 

Enter  a  number:  2.66 
Enter  the  number  of  digits:  40 

The  lower  bound 

=  10.1010100011110101110000101000111101011100 
=  2.6599999999998544808477163314819335937500 

The  upper  bound 

=  10.1010100011110101110000101000111101011101 
=  2.6600000000007639755494892597198486328125 

The  result 

=[1.0101110101011010011101110101011010011101000 

1001010100011111010011011011001110011, 

1.0101110101011010011101110101011010011101111 

1011000110111011111001001011101111110] 

=[1.3646616541346455174907343363899815669431497 

8221171941186184994876384735107421875, 

1.3646616541354540314929603602400631670414632 

4347846075397683307528495788574218750] 
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6.  Conclusion 

We  have  successfully  developed  a  set  of 
functions  for  interval  arithmetic  by  using  the 
Common  LISP  language.  These  functions  are 
machine  independent  and  the  precision  is  determined 
by  the  user.  The  desired  number  of  binary  digits  is 
assigned  by  the  user,  this  number  dominates  the 
precision  in  the  computation.  The  LISP  environment 
stores  integers  and  symbols  in  a  linked  list  manner  of 
memory  cells.  It  has  no  range  limitation  on  integer 
arithmetic;  the  only  limit  is  its  memory  size. 
Therefore,  it  is  ready  to  use  these  functions  for  any 
applications  in  interval  computations. 
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0.  Abstract  2.  Approaches  to  Learning  about  Software 


Historically,  computer  programs  (and  especially  those  for 
statistical  applications)  used  separate  documentation.  Online 
documentation,  especially  "help"  integrated  and  customized 
to  the  program  context,  has  become  a  common  and  useful 
feature  of  contemporary  programs. 

This  talk  discusses  some  advantages  of  operating  the 
program  FROM  the  documentation  rather  than  the 
documentation  from  the  program.  This  approach  allows  users 
to  focus  on  the  task  that  they  want  to  do,  rather  than 
learning  the  syntax  or  other  details  of  a  feature  of  the 
program.  That  is,  we  can  set  aside  the  peculiarities  of  the 
program  design  and  concentrate  on  what  has  to  be  done. 
Moreover,  the  documentation  can  encompass  several 
different  pieces  of  software,  allowing  comparisons  of 
different  programs  or  their  cooperative  use  for  problem 
solving. 

The  Software  Taxi  is  a  simple  hypertext  documentation  and 
program  launching  system  developed  by  the  author  with 
Mary  M.  Nash  and  assisted  by  Mary  Walker-Smith.  The 
discussion  illustrates  the  idea  outlined  above  and  tests  its 
strengths  and  limitations. 

1.  The  Problem 

The  growth  in  the  number  and  size  of  software  packages, 
particularly  those  for  scientific  and  statistical  computation, 
presents  users  and  prospective  users  with  a  learning-cost 
problem.  That  is,  it  is  costly  of  time  and  effort,  and  often 
money  for  (unnecessary?)  software  acquisition,  to 

•  learn  about  software  and  its  features  and  style 

•  learn  how  to  use  (statistical)  software 

This  issue  software  acquisition  and  use  is  hardly  new.  It 
is,  however,  exaggerated  by  the  burgeoning  size  and 
complexity  of  packages.  Worse,  there  seem  to  be  few 
serious  attempts  to  address  the  issue  of  software  learning 
costs.  This  paper  considers  the  learning-cost  problem  and 
suggests  one  way  to  address  it  that  works  with  many  but 
unfortunately  not  all  packages. 


If  one  can  afford  the  fees  and  the  time,  training  courses 
are  a  very  good  way  to  learn  about  a  particular  software 
package.  They  tend  to  stress  the  "how  to"  aspects  of  using 
software  rather  than  its  features  and  capabilities.  We  must 
hope  that  the  topics  covered  are  those  of  interest  to  us. 

If  we  already  have  access  to  a  particular  software  package, 
then  we  can  read  the  manual.  We  should  read  it  again 
before  trying  the  program.  Again,  the  manuals  tend  to 
focus  on  "how  to".  If  we  are  in  a  hurry  to  do  some 
specific  calculation,  we  may  find  manuals  quite 
frustratingly  detailed.  They  may  presume  we  have  read  the 
material  starting  at  the  beginning.  The  trend  to  multi¬ 
volume  manuals  renders  this  approach  even  less  attractive. 

"Quick  Start"  manuals  and  on-line  tutorials  could  be  used 
to  attempt  to  employ  the  program  quickly.  The  tools 
illustrated,  such  as  simple  descriptive  statistics  or 
regression,  may  not  be  the  ones  we  want  to  learn. 

Many  of  us  will  simply  "try  out"  a  program  and  look  for 
pointers  on  its  features  and  method  of  use.  That  is,  we 
hope  that  the  affordances  of  the  design  (Norman,  1992) 

—  the  layout,  symbols,  and  other  indicators  in  the  user 
interface  —  are  sufficient  to  let  us  infer  what  can  be  done 
and  how  to  do  it.  This  is  difficult  to  arrange  in  a  clear 
way  for  mathematical  software  of  any  sophistication. 

While  there  has  been  much  investment  in  "intuitive"  user 
interfaces,  it  is  hardly  evident  that  such  developments  are 
helpful  to  users  trying  to  learn  the  features  and  use  of 
statistical  software. 

•  The  conventions  used  by  the  interface  designer 
may  be  unfamiliar  or  not  obvious  to  users. 

•  Such  conventions  may  not  be  easy  to  learn.  For 
example,  those  who  use  a  strong  wrist-hand 
motion  to  play  piano  may  move  the  whole  mouse 
when  trying  to  activate  a  button. 

•  The  typical  icons  and  simple  pull-down  menus 
may  be  practically  useless  for  even  moderately 
complicated  statistical  operations.  This  may  be 
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the  reason  that  windowed  spreadsheet  software 
has  introduced  "analysis  dialogue  boxes"  for 
statistical  operations  to  allow  for  the  setting  and 
adjustment  of  options  and  control  parameters. 

3.  User  Needs  in  Software  Documentation 

The  learning  issue  is  intimately  connected  with  that  of  the 
design  of  user-documentation  for  software.  By  looking  at 
what  the  user  needs  in  documentation,  we  may  find  some 
pointers  to  helpful  user-interface  features. 

First,  user  documentation  needs  the  actual  documentation 
material  to  be  well-organized.  We  need  to  know 

•  what  can  be  done 

•  how  to  do  it,  and 

•  extra  features  available. 

Beyond  the  organization  of  material,  users  are  helped  by  a 
good  table  of  contents  to  provide  map  to  the  material. 

Third,  and  especially  useful  to  users  with  some 
experience,  a  complete  index  is  needed.  This  should  have 

•  listings  of  alternate  terms  for  any  concepts  or 
topics  where  different  nomenclature  or  usage 
exists; 

•  appropriately  distinguished  meanings  of  terms 
where  confusion  may  arise.  For  example,  chi- 
squared  statistics  are  used  in  a  variety  of 
situations,  so  that  an  unmodified  entry  under  this 
topic  is  not  very  helpful. 

Fourth,  documentation  should  include  reasonable 
examples.  This  is  an  onerous  requirement,  since  the 
description  of  non-trivial  examples  that  do  not  overload 
the  user  with  a  plethora  of  details  requires  a  good  deal  of 
effort  and  care. 

Finally,  the  documentation  must,  as  concisely  as  possible, 
give  a  clear  description  of  the  software  capabOities. 

4.  Hypertext 

Hypertext  provides  a  good  way  to  structure  software 
documentation.  For  those  unfamiliar  with  hypertext,  the 
term  here  refers  to  a  mechanism  for  displaying  units 
(usually  screens)  of  material  in  which  are  embedded  menu 
choices  or  "buttons"  that  a  user  can  select  to  move  to 
another  unit  or  screen  of  material.  The  process  of  moving 
from  one  screen  to  another  is  called  navigating  the 
hypertext.  In  most  examples,  and  in  the  Software  Taxi 
mentioned  below,  the  "buttons"  can  also  start  programs. 


While  hypertext  methods  offer  the  potential  to  present 
material  to  users  in  a  convenient  and  efficient  way,  we  do 
need  to  ensure  a  very  good  organization  of  the  material  to 
be  presented.  Details  can,  and  should,  be  "hidden"  in 
screens  that  are  off  the  main  pathway  users  are  likely  to 
follow  when  navigating  a  documentation  hypertext. 

By  running  programs  under  the  hypertext  manager,  we  can 
display  graphs  and  tables,  or  play  sounds  or  voice 
messages.  Unfortunately,  the  file  formats  and  file  sizes 
remain  a  difficulty  to  the  portability  of  the  hypertexts. 

Similarly,  we  can  run  the  program  we  wish  to  document, 
and  this  is  one  of  the  main  messages  of  this  paper.  That 
is,  after  documenting  a  program  feature,  we  can  run  the 
program  to  illustrate  how  it  works.  This  gives  us  our 
"documentation  with  online  programs",  turning  around  the 
conventional  approach  of  online  "help"  within  a  program. 
We  do  not,  of  course,  need  to  eliminate  the  latter.  Note 
that  this  idea  offers  a  twist  on  the  usual  "write  the 
documentation,  then  the  program"  maxim. 

5.  The  Software  Taxi 

The  Software  Taxi  was  introduced  at  the  25th  Interface  in 
San  Diego  (Nash,  1994)  as  a  prototype  to  test  possibilities 
of  running  several  programs  to  attack  a  computational 
problem.  The  design,  partly  because  of  its  experimental 
nature,  was  of  necessity  kept  deliberately  simple  and 
small,  yet  had  to  allow  easy  "jumps"  between  different 
hypertext  files.  Indeed  its  development  was  a  result  of 
dissatisfaction  with  two  commercial  and  one  shareware 
systems  along  with  a  lack  of  documentation  of  such 
features  in  some  other  systems. 

The  1993  version  of  the  Software  Taxi  took  only  a  few 
days  of  effort  to  prepare.  The  latest  version,  while  adding 
little  superficially,  has  been  improved  to  more  easily  run 
other  programs,  has  added  utilities  to  verify  hypertext 
files,  allows  users  to  access  hypertexts  in  a  given  directory 
directly,  and  generally  is  better  set  up  for  use  by  both 
authors  and  general  users. 

We  still  regard  the  Software  Taxi  as  a  prototyping  tool  to 
structure  and  test  the  material  in  a  hypertext.  Since  it  is 
based  on  plain  text,  the  files  are  portable  but  are  not 
"fancy".  Should  the  Hypertext  Markup  Language  (HTML) 
stabilize,  we  would  consider  extending  the  Software  Taxi 
with  an  HTML  "front-end". 

At  the  time  of  writing,  the  Software  Taxi  is  being  prepared 
for  distribution.  The  Level  0  (user)  version  will  be 
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freeware  and  can  be  obtained  by  electronic  mail  from  the 
author.  We  hope  to  install  it  on  various  bulletin  boards 
and  ftp  servers.  For  those  wishing  to  author  hypertexts, 
there  are  two  other  levels  of  the  software  incorporating 
various  aids  to  hypertext  preparation  and  support. 

6.  Advantages  of  a  Hypertext  Approach 

The  main  advantage  of  the  hypertext  approach  is  that  We 
can  tell  a  user  about  a  feature  of  some  software  then 
immediately  and  directly  illustrate  what  we  have  just 
discussed.  We  need  not  worry  that  documentation 
describes  a  feature  that  has  been  altered  or  replaced. 

More  importantly,  we  only  need  focus  on  elements  in  the 
software  that  are  of  current  interest.  A  large  statistical 
package  will  have  many  more  capabilities  than  a  single 
user  is  likely  to  be  concerned  with  on  a  single  occasion. 
Thus,  if  we  are  concerned  about  robust  regression,  for 
example,  the  documentation  can  discuss  the  merits  of 
different  approaches,  the  reasons  for  implementing  one  or 
another,  the  control  parameters  and  other  highly 
specialized  matters,  and  the  user  can  try  out  and  learn 
about  these  options  without  having  to  learn  a  great  deal  of 
the  rest  of  the  package.  Moreover,  the  hypertext  scripts 
provide  examples  of  how  to  control  the  program. 

Having  achieved  the  title  goal  of  “documentation  with 
online  programs",  we  note  that  the  programs  do  need 
scripts.  The  preparation  of  these  can  be  assisted  by 
programs  that  are  built  into  the  hypertext.  Such  program 
generators  are  a  form  of  tool  that  could  be  more 
generally  used  to  good  effect  in  scientific  computation. 

In  the  Software  Taxi  we  have  found  it  useful  to  capture 
the  sequence  of  screens  or  actions  chosen  by  a  user  and  to 
allow  automatic  playback  of  such  sequences.  This  is  a 
possible  approach  for  organizing  "work  in  progress",  in 
essence  attempting  to  automate  the  lab  notebook. 

7.  Disadvantages  of  the  Hypertext  Approach 

In  trying  to  reduce  learning  costs,  we  can  remove  the 
bull^  manual,  but  we  still  require  the  user  to  read  at  least 
a  few  of  the  documentation  screens.  (You  still  have  to 
watch  the  movie,  even  if  you  don’t  wish  to  read  the 
novel.)  Moreover,  the  hypertexts  must  be  prepared.  Even 
though  the  Software  Taxi  is  designed  to  make  this  as 
simple  as  possible,  it  is  still  a  chore. 

More  seriously,  a  lot  of  software  cannot  be  run  under 
control  of  a  script.  This  is  particularly  true  of  such  popular 


tools  as  spreadsheets.  It  also  applies  to  almost  all  software 
set  up  for  "windowed"  operating  environments.  While 
there  are  some  intrinsic  obstacles  to  controlling  certain 
graphic  operations  by  scripts  (P  Velleman,  in  Goldstein, 
1993),  it  should  be  relatively  simple  to  provide  scripts  at 
the  level  of  "what"  to  do.  However,  after  nearly  a  decade, 
the  Apple  Macintosh  operating  environment  is  just  now 
getting  command  script  capability.  (There  have  been  some 
third-party  offerings.) 

A  final  caution  with  hypertexts  is  that  changes  in  system 
configuration  (or  movement  of  hypertexts  to  different 
systems)  can  cause  unpredictable  results.  This  is,  of 
course,  a  continuing  issue  for  any  program  that  behaves  as 
an  operating  shell. 

8.  Trends 

It  seems  obvious  that  personal  computing  equipment  such 
as  the  Apple  Newton  and  similar  book-sized  devices  are 
likely  to  proliferate  and  become  the  principal  computing 
interface  for  many  users.  Such  devices  use  "pen-based" 
operating  environments  and  are  well-suited  to  the 
hypertext  /  action  approach  to  documentation  and  use  of 
programs.  Moreover,  as  users  need  to  run  more 
complicated  software,  graphic  icons  are  more  likely  to  be 
confusing  rather  than  helpful,  especially  with  a  small- 
screen  in  uncertain  lighting.  Plain  text  allows  greater  detail 
to  be  presented  and  may  be  more  helpful  to  users. 

9.  Summary 

In  many  situations  we  can  document  program  features  and 
processes  then  illustrate  them  "on-line".  Moreover,  this 
approach  can  be  simple  and  effective.  However,  the  need 
for  scripting  remains  an  obstacle  to  easy  porting  of  the 
approach  to  windowed  operating  environments. 
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Abstract 

Last  year  the  authors  presented  a  preliminary  report 
on  the  advantages  and  disadvantages  of  employing  a 
widely-used  spreadsheet  package  in  an  introductory  applied 
statistics  course.  In  that  investigation  there  was  a  detailed 
comparison  of  the  latest  versions  of  Minitab,  Microsoft 
Excel,  and  Lotus  1-2-3  for  classroom  use,  in  which  the 
authors  recommended  the  well-known  statistical  parkags 
over  the  two  popular  spreadsheet  packages.  Since  that  re¬ 
port,  both  authors  have  taught  courses  with  Minitab  for 
Windows  and  Excel.  In  addition,  both  Minitab  and  Excel 
have  released  new  versions.  In  this  paper  the  authors  will 
present  an  updated  recommendation  on  what  is  the  most 
appropriate  software  to  use  in  today's  applied  statistics 
courses.  This  reconunendation  will  be  based  upon  their 
classroom  experiences  with  Windows-based  software  and  a 
complete  evaluation  of  the  features  introduced  in  the  new 
releases  of  Minitab  and  Excel. 

Introduction 

Babson  College  is  a  small  private  college  located  in  a 
suburb  of  Boston.  It  only  offers  undergraduate  degrees  in 
business  and  MBAs.  All  of  its  students  are  required  to  take 
an  applied  statistics  course  or  its  equivalent.  For  over  20 
years  statistical  software  has  been  a  vital  component  of 
these  courses.  In  recent  years  the  platform  for  such  soft¬ 
ware  at  Babson  has  been  Ae  school's  VAX  computer.  But 
with  students  alrea^  exposed  to  and  many  businesses 
moving  to  a  Windows  environment,  it  was  determined  in 
the  spring  of  1993  that  this  should  be  the  fijture  environ¬ 
ment  for  Babson's  statistical  software.  At  that  time  we  be¬ 
gan  an  extensive  search  for  such  software.  Among  the  se¬ 
rious  possibilities  for  our  1993-1994  academic  year  were 
using  statistical  software  on  the  VAX  for  another  year,  us¬ 
ing  a  popular  Windows-based  spreadsheet  package  with 
statistical  capabilities,  and  using  newly  developed  Win¬ 
dows-based  statistical  software  packages. 

At  last  year's  Interface  we  presented  a  comparison  of 
Minitab  9.0  for  the  VAX  and  two  popular  spreadsheets,  Lo¬ 
tus  1-2-3  and  Excel  4.0.  Minitab  is  a  very  popular  package 
for  introductory  statistics  courses  and  has  b^n  used  at  Bab¬ 
son  since  the  early  1980s.  Among  the  reasons  for  consider¬ 
ing  a  change  to  a  spreadsheet  package  were  low  incre¬ 
mental  cost  ($0),  a  familiar  and  user-fiiendly  interface  (all 
business  students  use  spreadsheets),  and  expanded  statisti¬ 


cal  capabilities.  After  discovering  that  Lotus,  at  that  *im<» 
had  only  descriptive  statistics  and  regression  available,  we 
focused  our  analysis  on  Excel  4.0.  Initially  we  were  quite 
impressed  by  the  statistical  tools  available  in  Excel  4.0. 
But  soon  we  became  dismayed  by  some  serious  problems 
with  Excel  4.0.  Thus  at  Interface  '93  we  gave  a  gnide  of  C 
or  D  to  Excel  4.0,  but  with  the  potential  of  an  A  grade  in 
the  future,  and  we  recommended  that  users  not  move  to  a 
q>readsheet  package  at  that  time. 

Shortly  after  last  year's  Interface,  Minitab  announced 
its  first  Windows  product.  After  ftirther  study  Babson  de¬ 
cided  to  use  both  this  product  and  Minitab  on  the  VAX 
during  the  1993-1994  academic  year.  We  also  dtyided  to 
reconsider  our  decision  in  1994.  In  this  paper  we  will  dis¬ 
cuss  our  latest  recommendation  for  the  most  appropriate 
software  for  our  statistics  courses. 

Windows-Based  Software  for  Statistics 

For  oiu  1994  search  we  did  not  consider  any  VAX 
software  due  to  Babson's  migration  to  Windows.  Based 
upon  the  school's  decision  to  use  Excel  and  other  Microsoft 
application  software  throughout  the  campus,  we  only  exam¬ 
ined  Excel  5.0.  This  decision  was  made  even  though  the 
latest  releases  of  Lotus  1-2-3  and  Quattro  Pro  have  many 
statistical  fimctions. 

We  also  decided  only  to  examine  Minitab  for  Win¬ 
dows,  even  though  most  of  the  major  statistical  pnkagptc 
are  now  released  on  Windows.  Here,  with  the  assistance  of 
a  communication  from  Robin  Lock,  are  some  of  this  soft¬ 
ware  where  an  asterisk  (*)  indicates  the  presence  of  a  stu¬ 
dent  edition:  BMDP*,  Minitab*  SAS,  S-Plus,  SPSS*, 
Stata,  Statgraphics,  Statistica,  and  Systat*/Mystat*.  We 
made  this  decision  due  to  our  favorable  reaction  to  Minitab 
Release  9.0  for  Windows  and  the  limited  amount  of  timp. 
available  to  us  for  a  thorough  evaluation. 

Below  is  discussion  of  the  similarities  and  differences 
between  Excel  4.0  and  Excel  5.0,  and  between  Minitab  9.0 
for  the  VAX  and  Minitab  9.0  and  10.0  for  Windows.  Then 
there  is  comparison  of  Excel,  as  a  representative  of  the 
Windows-based  spreadsheets  with  statistical  capabilities, 
and  Minitab,  as  a  rqrresentative  of  the  Windows-based  sta¬ 
tistical  packages.  This  is  followed  by  oiur  thoughts  about 
the  future. 
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Excel 

Excel  is  a  Windows-based  spreadsheet  application  de¬ 
veloped  and  sold  by  Microsoft  Corporation.  Excel  is  in¬ 
tended  to  be  user  friendly  with  a  graphical  user  interface. 
Thus  it  makes  extensive  use  of  graphics,  the  mouse,  tool 
bars,  and  drop-down  menus.  It  is  cunently  being  sold  as 
part  of  the  Microsoft  Office  Suite  and  shares  Drawing, 
Equation,  WordArt,  and  other  objects  with  the  other  appli¬ 
cations  in  the  suite.  Excel  has  an  installed  base  running  in 
the  millions. 

Although  the  "official  list  price"  of  Excel  has  been 
around  $400.  In  reality  the  street  price  of  an  Excel  upgrade 
is  around  $100.  The  price  of  an  Office  upgrade  which  in¬ 
cludes  Word  6.0,  PowerPoint  4.0,  Access  2.0,  and  Excel  5.0 
is  under  $300.  For  students,  the  price  of  the  Office  Suite  is 
under  $200.  Quantity  pricing  discounts  are  available  for 
large  purchases. 

The  December  1993  EXCEL.EXE  (version  5.0)  file 
has  a  length  of  4,185,600  bytes  and  the  complete  Excel 
Application  Package  occupies  16,367,256  bytes.  Of  course, 
there  is  some  ambiguity  associated  with  these  numbers  be¬ 
cause  there  are  a  host  of  Microsoft  Applications  such  as 
Drawing,  Equation,  and  WordArt  which  work  with  Excel 
and  are  not  counted  in  the  above  numbers. 

Due  to  the  fact  that  Excel  is  a  spreadsheet  package 
and  not  a  statistical  package,  it  is  often  difficult  to  find  its 
statistical  features.  Furthermore,  there  are  only  a  limited 
number  of  statistical  capabilities  in  Excel.  For  example,  it 
does  not  handle  nonparametric  analyses.  Excel  also  has 
problems  analyzing  a  large  data  set. 

There  is  only  limited  external  documentation  on  the 
statistical  capabilities  of  Excel.  For  example,  it  is  difficult 
to  determine  the  computational  algorithms  used  in  Excel, 
although  the  user  guide  does  suggest  that  the  most  appro¬ 
priate  algorithms  are  not  being  used.  (A  discussion  on  the 
computational  weakness  of  spreadsheets  recently  appeared 
on  the  Internet  Edstat  Discussion  List.)  In  addition,  Micro¬ 
soft  provides  limited  statistical  support  to  its  users. 

The  statistical  features  of  Excel  are  organized  into 
Functions  and  Data  Analysis  Tools.  The  fimctions  are 
characterized  by  requiring  typically  one,  two,  or  three  input 
parameters.  These  parameters  are  usually  numbers,  strings, 
or  ranges.  The  functions  return  anything  from  a  single 
number  or  string  to  a  complex  data  structure  such  as  a 
frequency  distribution  table  in  a  vertical  array. 

Examples  of  functions,  along  with  their  Excel  descrip¬ 
tions,  include 

CHIINV(probability,  degrees  of  freedom) 

returns  the  inverse  of  the  chi-squared  distribution 


COVAR(arrayl,  array2) 

returns  covariance,  the  average  of  the  products  of 
paired  deviations 

FORECAST(x,  known  y's,  known  x's) 
return  a  value  along  a  linear  trend 
NORMDIST(x,  mean,  standard  deviation,  cumulative) 
returns  normal  cumulative  distribution 

Functions  are  invoked  by  using  the  Insert  Function  or 
the  Function  Wizard  Button.  To  understand  how  these 
functions  are  used,  consider  NORMDIST.  The  meaning  of 
x,  mean,  and  standard  deviation,  are  relatively  obvious. 
What  is  not  clear  is  that  cumulative  is  a  logical  variable:  a 
value  of  Trae  or  1  causes  NORMDIST  to  return  the 
cumulative  value  of  the  normal  distribution  while  a  value  of 
False  or  0  causes  NORMDIST  to  return  the  value  of  the 
normal  density  function.  To  help  the  user  with  the  choice 
of  inputs.  Excel  5.0  has  a  Function  Wizard  with  labeled 
boxes  for  the  inputs  and  a  display  box  for  the  output.  Thus 
the  user  can  vary  the  inputs  and  observe  the  ouq>ut  before 
telling  the  wizard  to  put  the  result  in  the  spreadsheet.  This 
feature  is  new  in  Excel  5.0  and  is  an  improvement  over 
Excel  4.0  but  still  needs  additional  development.  The 
Function  Wizard  for  NORMDIST  does  not  make  clear  the 
meaning  or  possible  values  of  cumulative.  Although 
additinual  information  can  be  determined  by  using  the  on¬ 
line  help,  this  causes  a  time  delay  to  wait  for  the  help 
screen  to  appear.  A  major  improvement  would  be  to 
display  a  complete  set  of  information  in  the  Function 
Wizard  Window. 

Many  of  the  functions  are  add-ins  which  must  be 
brought  into  Excel  before  they  can  be  used.  Unfortunately 
before  an  add-in  can  be  part  of  a  menu  it  must  be  loaded 
even  if  it  is  not  used  during  a  given  session.  A  better 
choice  would  be  to  include  all  add-ins  on  the  menu  and 
load  them  only  as  needed. 

The  data  analysis  tools  are  invoked  by  an  entirely 
mechanism  than  the  functions.  To  activate  a  data 
analysis  tool  choose  Tools  from  the  Command  Menu.  This 
is  then  followed  by  choosing  the  Data  Analysis  subcom¬ 
mand.  Since  the  Data  command  is  next  to  the  Tools  com¬ 
mand,  it  is  eaty  to  confuse  the  sequence  Tools  Data  Analy¬ 
sis  with  Data  Analysis  Tools.  The  later  sequence  does  not 
exist. 

After  a  delay,  the  InputBox  Window  appears.  This 
Window  features  a  number  of  labeled  spaces  for  input. 
This  window  look  very  similar  to  the  Function  Wizard  In¬ 
putBox  Window.  It  is  very  eaty  to  forget  whether  to  invoke 
a  fimction  or  to  invoke  a  tool  in  order  to  accomplish  a 
particular  statistical  task. 

A  more  major  problem  occurs  when  you  actually  use 
one  of  the  tools.  Consider  the  use  of  the  HISTOGRAM 
tool.  While  using  the  HISTOGRAM  tool,  you  discover  that 


382  Software  for  a  Statistics  Course 


a  range  containing  input  categories  is  required  in  order  to 
construct  a  histogram.  Unfortunately  in  order  to  construct 
such  a  range  you  must  cancel  HISTOGRAM.  Then  you 
must  construct  the  input  categories  without  help.  If  needed 
you  can  use  the  on-line  help,  but  you  must  remember  your 
exact  help  topic.  Now  you  must  invoke  HISTOGRAM 
again.  Obviously  if  you  need  to  do  something  in  order  to 
execute  a  command,  you  should  be  able  to  do  so  by  pausing 
in  the  middle  of  the  sequence  to  do  whatever  is  necessary  to 
resume  the  command  sequence.  It  is  hoped  that  this  feature 
appears  in  the  next  version  of  Excel. 

We  found  a  number  of  calculations  which  were 
statistically  incorrect.  Among  these  bugs  were  the  naming 
of  a  ruiique  mode  in  bimodal  situation.,  the  result  of  0  for 
the  maximum  when  only  missing  values  were  present,  and 
the  lack  of  tied  rank  values.  Other  serious  computational 
problems  involved  p-values,  output  when  alpha  was 
specified  as  zero  or  one,  and  regression  ouQrut  from 
collinear  data. 

The  vocabulary  used  for  describing  statistical 
calculations  often  represents  a  poor  choice  of  terminology 
and  surprisingly  sometimes  is  totally  inappropriate.  Some 
examples  of  such  errors  include  the  incorrect  designation  of 
one-and  two-sided  p-values,  and  the  specification  of  a  p- 
value  as  a  test  statistic.  Another  serious  terminology  gaffe 
is  stating  the  equivalence  of  alphas  and  confidence  intervals 
in  performing  confidence  tests.  Examples  of  adHitinnal 
problems  with  Excel  are  available  upon  request  from  the 
authors. 


Excel  4.0  versus  Excel  5.0 

The  philosophy  behind  Excel  is  to  create  a  basic 
spreadsheet  engine  and  then  enhance  it  through  a  variety  of 
specific  add-ins.  The  good  news  is  that  this  makes  it  easy 
to  enhance  the  spreadsheet  through  a  variety  of  user  or 
commercial  macros.  This  has  become  especially  true  since 
Microsoft  switched  from  the  Excel  4.0  macro  language  to 
Visual  Basic  for  Applications.  This  is  a  user-fiiendly 
macro  language  which  makes  it  easy  to  write  user-fiiendly, 
visually-attractive,  Windows-based  applications.  To  pro¬ 
mote  this  development  Microsoft  sells  an  Excel  Developer's 
Kit,  Version  5  for  imder  $50. 

The  bad  news  is  that  this  strategy  makes  macros  much 
slower  than  if  they  were  compiled  to  native  code.  For  ex¬ 
ample,  Excel  requires  90  to  110  seconds  to  generate  1000 
random  numbers  with  a  mean  of  10  and  a  standard  devia¬ 
tion  of  2.  (Not  all  Excel  tasks  are  slower.  To  produce  a 
histogram  of  the  above  numbers  in  Excel  requires  less  than 
30  seconds.) 

In  addition  to  the  introduction  of  Visual  Basic,  there 
were  many  improvements  to  the  user  fnendliness  of  Excel 
with  the  introduction  of  version  5.0.  Among  these  were  the 


introduction  of  sheets,  pivot  tables,  drop  down  identifica¬ 
tion  labels  for  tool  buttons,  and  tips.  There  were  also  more 
extensive  use  of  tool  bars,  improvements  to  Wizards,  and 
more  convenient  zoom  capability.  In  contrast,  as  re¬ 
searched  by  Derek  Upson  there  were  few  new  statistical  ca¬ 
pabilities  introduced  in  Excel  5.0.  Two  exceptions  of  note 
were  the  capability  to  link  graphs  and  spreadsheets  in  real 
time  and  the  capability  to  specify  boimdaries  for  a  probabil¬ 
ity  calculation  in  one  of  the  functions. 

When  we  first  examined  the  statistical  features  in  Ex¬ 
cel  4.0  we  found  problems  with  the  vocabulary  used  for  de¬ 
scribing  statistical  calculations  and  the  accuracy  of  the  cal¬ 
culations.  Very  few  of  these  problems  were  fixed  in  version 
5.0.  One  such  correction  was  the  proper  computation  of  the 
p-value  mentioned  above.  Another  conection  dealt  with 
the  collinear  regression  output,  but  in  this  case  another 
problem  was  introduced.  0/0  does  not  equal  65535. 


Excel  as  a  Statistical  Package 


Advantages 

Cost 

Large  Installed  Base 
Known  Interface 
Extensive  On-Line  Help 
Visual  Basic  Macro 
Language 

Up-To-Date  Features  Such 
as  Wizards 


Disadvantages 
Not  a  Statistical  Package 
Limited  Statistical  Support 
Slow  (Add-In  Packages) 
Inconsistent  Design 
Lack  of 
Capabilities 
Computational 
Concerns 

Poor  Choice  of  Terminology 
Bugs 


Minitab 

Minitab  is  a  popular  statistical  package  available  on  a 
large  number  of  platforms.  It  is  developed  and  sold  by 
Minitab,  Inc.  Its  mainframe,  microcomputer,  and  PC  ver¬ 
sions  employ  an  easy-to-use  session  command  interface, 
while  its  Macintosh  and  Windows  versions  employ  graphi¬ 
cal  user  interfaces. 

More  students  have  been  introduced  to  statistical 
software  by  the  use  of  Minitab  than  any  other  piece  of  soft¬ 
ware.  Examples  of  its  output  are  contained  in  a  large  num¬ 
ber  of  textbooks  from  a  wide  range  of  disciplines.  It  is  also 
used  by  analysts  in  many  companies  and  government 
agencies. 

The  academic  price  for  a  single  copy  of  Minitab  for 
windows  is  under  $500.  Students  may  purchase  the  full 
package  for  under  $200.  In  addition,  a  student  edition  may 
be  purchased  for  about  $50.  Quantity  purchase  prices  are 
also  available  for  academic  institutions. 

The  MINITAB.EXE  April  1993  file  has  a  length  of 
4,227,072  bytes  while  the  Minitab  application  files  total 
9,869,097  bytes  excluding  the  data  sets. 
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Minitab  contains  most  of  the  capabilities  needed  for 
standard  statistical  analyses.  In  addition,  it  features  a  large 
number  of  quality  tools.  There  appear  to  be  few  difiSculties 
with  miming  large  data  sets  in  Minitab. 

Minitab  provides  a  variety  of  well  prepared  documen¬ 
tation.  As  with  most  of  the  statistical  software  mentioned 
above,  Minitab  uses  respected  statistical  algorithms. 

For  the  most  part  due  to  its  sole  mission  of  providing 
statistical  tools.  Minitab  provides  a  user-friendly  environ¬ 
ment.  Still  there  are  some  exceptions:  users  must  type  in 
functions  when  forming  expressions  and  must  leave  dialog 
boxes  in  order  to  request  help.  Due  to  20  years  of  provid¬ 
ing  statistical  software.  Minitab  presents  few  problems  with 
terminology  and  is  relatively  bug-free. 

Minitab  9.0  for  the  VAX  versus 
Minitab  9.0  and  10.0  for  Windows 

Minitab  9.0  for  the  VAX  is  a  powerful  member  of  the 
Minitab  command-driven  family.  It  contains  a  compre¬ 
hensive  set  of  statistical  capabilities.  Among  the  newer 
commands  in  this  release  are  a  factor  analysis  command,  a 
multivariate  analysis  of  variance  (MANOVA)  command, 
and  many  new  commands  for  quality  control  and  design  of 
experiments.  With  the  proper  hardware,  it  can  produce  a 
variety  of  high-resolution  graphs.  This  release  of  Minitab 
contains  a  powerful  new  macro  capability. 

Minitab  9.0  for  Windows  contains  all  the  capabilities 
of  the  VAX  version  tyou  do  not  need  additional  hardware 
to  produce  the  high-resolution  graphs).  The  major  differ¬ 
ence  between  the  two  versions  is  the  presence  of  the  Win¬ 
dows  graphical  user  interface.  Hence  Minitab  9.0  for  Win¬ 
dows  uses  a  mouse  and  keyboard  to  enter  commands 
through  drop-down  menus,  dialog  boxes,  and  even  session 
entries.  There  are  five  windows  (Data,  Session,  Info,  His- 
toiy,  and  Graph)  in  this  program.  It  does  not  use  toolbars 
or  smart  keys. 

Minitab  10.0  for  Windows  is  the  newest  Minitab 
product.  It  provides  additional  built-in  help  along  with 
more  powerful  data  management  capabilities.  Among 
these  are  a  direct  interface  with  Excel  for  the  transfer  of 
data  and  linking  data  using  Itynamic  Data  Exchange 
(DDE).  New  statistical  commands  in  this  release  are  ones 
for  cluster  analysis,  classical  time  series  analysis,  and  the 
design  of  experiments.  There  are  also  many  new  plots 
along  with  the  capability  to  edit  and  brush  graphs. 

Excel  versus  Minitab 

We  compared  the  statistical  capabilities  of  Minitab 
and  Excel  in  ten  dififerent  areas:  descriptive  statistics,  infer¬ 
ence  on  means,  inference  on  proportions,  ANOVA,  regres¬ 
sion,  contingency  tables,  nonparametric  statistics,  time  se¬ 


ries  analysis,  quality,  and  probability.  In  seven  cases  we 
concluded  that  Minitab  was  clearly  superior  to  Excel.  In 
the  case  of  inference  on  proportions  they  were  equivalent 
because  neither  package  performed  that  analysis.  In  two 
other  cases,  descriptive  statistics  and  probability  we  rated 
the  packages  as  equals. 

For  example,  let  us  consider  the  descriptive  displays 
(graphs  and  tables)  available  in  the  two  packages.  A  com¬ 
parison  of  the  graphical  capabilities  of  Minitab  for  Win¬ 
dows  with  those  of  Excel  shows  that  the  two  packages  are 
roughly  equal.  Perhaps  a  slight  advantage  goes  to  Minitab 
in  terms  of  the  diversity  of  graphs  produced.  A  definite  ad¬ 
vantage  goes  to  Excel  in  terms  of  editing  and  manipulating 
the  graphs  which  the  package  produces.  We  found  it  ex¬ 
tremely  aMcward  and  difficult  to  edit  the  Minitab  graphs  in 
release  9.0,  while  the  Excel  editing  process  soon  became 
almost  trivial  through  the  use  of  the  mouse.  Minitab  10.0 
has  greatly  improved  the  easy  of  editing  graphs  and  is  more 
similar  to  Excel  5.0. 

Minitab  does  not  do  three-dimensional  scatter,  radar, 
or  donut  graphs.  Excel  does  not  do  three-dimensional  scat¬ 
ter,  control  charts,  cause  and  effect,  dot  charts,  unnotched 
box,  or  notched  box  graphs.  Both  packages  did  standard 
bar,  grouped  bar,  stacked  bar,  histogram,  two-dimensional 
scatter,  Pareto  diagrams,  polygons,  high  low  close,  projec¬ 
tion,  and  contour/surface  graphs. 

Both  packages  were  similar  in  their  ability  to  generate 
a  variety  of  tables  including  cross  tabulation,  summary, 
cumulative  distribution,  frequenqr  distribution,  and  per¬ 
centage  distribution  tables.  The  pivot  table  concept  in  Ex¬ 
cel  5.0  makes  the  manipulation  of  tables  easy  with  the  use 
of  the  mouse. 

For  some  tasks.  Excel  is  far  slower  than  Minitab.  For 
example,  the  random  number  calculation  presented  above, 
requires  less  than  five  seconds  in  Minitab  9.0  for  Windows. 

Excel  has  the  advantage  of  a  large  installed  base  run¬ 
ning  into  the  millions.  As  a  result  many  users  are  already 
^miliar  with  the  user  interface.  There  is  extensive  on-line 
documentation,  but  accessing  it  can  be  slow  on  older  ma¬ 
chines.  The  addition  of  Visual  Basic  makes  it  much  easier 
to  write  user-fiiendly  macros.  The  cost  of  the  statistical 
features  of  Excel  is  almost  zero.  It  is  not  necessary  to  buy 
and  support  a  separate  statistical  package. 

Still  Excel  is  not  a  true  statistical  package,  even 
though  it  provides  many  essential  building  blocks.  Micro¬ 
soft  only  provides  limited  statistical  support  There  are 
problems  with  trying  to  analyze  large  scale  problems. 
Speed  is  also  a  problem  for  Excel.  The  access  to  statistical 
features  is  somewhat  inconsistent  Its  statistical  documen¬ 
tation  is  limited.  Some  of  its  terminology  is  poorly  chosen. 
There  are  concerns  about  some  of  its  algorithms.  Finally, 
there  are  far  too  many  computational  errors. 
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Thus  we  gave  a  grade  of  C  to  Excel  5.0,  again  with 
the  potential  of  an  A  grade  in  the  future.  In  addition,  we 
reconunended  the  continued  use  of  Minitab  10.0  for 
Windows  for  the  1994-1995  academic  year. 

The  Future 

In  1993  we  were  shocked  by  what  we  discovered  about 
the  statistical  c^abilities  of  Excel.  How  could  Microsoft 
bring  a  product  with  so  many  weaknesses  to  market?  Then 
we  realized  that  this  may  be  part  of  Microsoft's  corporate 
strategy.  It  took  the  software  leader  many  years  to  perfect 
Windows.  The  first  release  of  Access,  one  of  its  data  base 
management  programs,  also  contained  many  bugs.  Per¬ 
haps  Microsoft  is  using  all  of  its  users  as  part  of  gigantic 
beta  test. 

Still  we  were  surprised  that  more  of  the  Excel  4.0 
problems  were  not  corrected  in  Excel  5.0.  Going  into  our 
1994  comparison  we  expected  more  fi’om  Microsoft.  Hope¬ 
fully  future  releases  of  Excel  will  contain  fewer  bugs,  be 
better  documented,  and  provide  easier  access  to  more  sta¬ 
tistical  analyses.  Without  a  doubt  more  people  will  analyze 
data  using  Excel  and  other  spreadsheets  in  the  future  due  to 
their  immense  base  of  users  and  their  companies'  aggres¬ 
sive  pricing  policies.  In  addition,  there  soon  will  be  well 
designed  add-ins  by  conunercial  vendors  to  enhance  the 
statistical  capabilities  of  spreadsheets  such  as  Excel. 

Statistical  software  vendors  should  be  aware  of  these 
strong  competitors  for  their  market.  In  an  environment  in 
which  only  change  is  constant,  they  must  continue  to  intro¬ 
duce  easier-to-use  products  with  increased  capabilities  at  a 
low  price  more  fi'equently.  They  should  also  be  aware  that 
many  purchasers  are  sadly  more  concerned  with  the  cost  of 
a  product  than  the  accuracy  of  its  algorithms.  Hence  these 
vendors  must  actively  consider  loss  leaders  such  students 
editions  in  order  to  maintain,  or  hopefully  increase,  the 
munber  of  users  who  biqr  their  products.  Otherwise,  they 
may  find  themselves  with  far  fewer  customers. 

Finally,  what  does  the  future  hold  for  us  users?  Still 
more  change.  In  this  market  we  believe  that  there  will  be 
many  new  interesting  products  for  us  to  consider  in  the  fu¬ 
ture.  At  Babson  we  are  already  preparing  for  next  year's 
evaluation. 
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Using  Multiple  Processors  to  Compute  Robust  Regression  Estimators 
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Abstract 

Robust  regression  estimators  are  notoriously  hard  to 
compute.  Often  algorithms  require  that  fairly  simple 
computations  be  done  on  many  subsets  of  the  data. 
Parallel  processing  machines  would  be  ideal  for  such 
computation  but  they  are  often  not  readily  available 
to  researchers,  and  even  if  available,  they  often  re¬ 
quire  extensive  modification  to  the  code.  A  far  sim¬ 
pler  approach  is  to  distribute  the  computation  across 
several  processors  on  a  network.  The  code  is  modi¬ 
fied  to  do  the  computations  on  a  specified  portion  of 
the  subsets,  then  the  problem  is  split  into  pieces  and 
each  available  processor  on  the  network  is  used  to  do 
a  portion  of  the  computation.  The  results  from  each 
processor  are  then  collected  and  the  final  answer  is 
computed. 

1  Introduction 

This  paper  is  presents  an  improved  version  of  the  code 
used  for  distributing  the  computation  of  the  exact 
least  median  of  squares  in  multiple  linear  regression. 
The  new  version  (available  from  the  first  author  by  e- 
mail  to  astroll@ukcc.uky.edu)  is  shown  to  be  faster 
than  the  code  discussed  in  Hawkins,  Simonoff,  and 
Stromberg  (1994). 

2  Distributed  versus  Parallel  Comput¬ 
ing 

The  distinction  between  parallel  and  distributed  com¬ 
puting  is  often  nebulous  but  still  extremely  impor¬ 
tant,  In  parallel  computing,  multiple  processors  share 
memory  and  exchange  information  while  performing 
a  computational  task.  In  an  ideally  parallelized  com¬ 
putation  using  k  processors,  the  computation  would 
be  completed  in  one  k^^  the  time  or  perhaps  even  less 
time.  In  a  distributed  computation  using  k  proces¬ 
sors,  each  processor  works  on  a  portion  of  the  total 
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computation  but  the  processors  do  not  share  mem¬ 
ory  or  other  information.  The  partial  solutions  from 
each  processor  are  collected  and  the  final  solution  is 
reported.  Because  the  processors  do  not  share  infor¬ 
mation,  distributed  computations  are  likely  to  require 
more  computing  time.  This  is  likely  to  be  the  rea¬ 
son  they  have  received  far  less  attention  in  the  litera¬ 
ture.  The  major  disadvantages  of  parallel  processing 
are  that  expensive  parallel  processing  machines  are 
required  and  that  software  code  is  usually  machine 
specific  so  the  user  must  learn  the  language  for  the 
available  machine  and  then  suffer  with  the  fact  that 
the  code  is  not  likely  to  be  portable  to  other  paral¬ 
lel  processing  machines.  Distributed  processes  do  not 
have  these  disadvantages.  They  can  run  on  an  exist¬ 
ing  network  of  CPUs,  and  the  code  transfers  with  at 
most  minor  modifications  (frequently  none!).  We  be¬ 
lieve  that  these  advantages  more  than  make  up  for  the 
fact  that  distributed  processes  may  be  slightly  slower 
than  idealized  parallel  processing. 

3  Steps  Required  to  Distribute  a 
Computation 

Hawkins,  Simonoff  and  Stromberg  (1994)  discuss  the 
steps  required  to  distribute  a  serial  computation 
across  several  processors.  There  steps  are: 

1.  Modify  the  serial  code  so  that  it  can  compute  any 
portion  of  the  total  computation, 

2.  Identify  which  processors  are  available  to  assist 
in  the  computation. 

3.  Generate  input  files  identifying  the  part  of  the 
computation  to  be  done  by  each  processor. 

4.  Construct  a  file  that  sends  the  input  files  and  the 
code  from  (1)  to  each  processor. 

5.  Execute  the  file  in  (4). 

6.  Collect  the  output  from  each  processor  and  re¬ 
port  the  final  solution. 

As  an  example,  they  compute  the  exact  value 
of  the  Iccist  median  of  squares  (Rousseeuw;  1984, 
Stromberg;  1993)  in  multiple  linear  regression.  Us¬ 
ing  these  steps  and  code  referenced  in  Hawkins,  et. 


386  Using  Multiple  Processors  to  Compute  Robust  Regression  Estimators 


al.,  they  provide  examples  showing  the  effectiveness  of 
this  type  of  distribution.  In  this  paper  we  will  discuss 
the  distribution  of  the  computation  of  the  exact  value 
of  the  LMS  estimate  for  the  data  set  “educat.dat” 
found  in  Rousseeuw  and  Leroy  (1986).  Hawkins  et. 
al.  report  that  the  median  computation  time  for  five 
runs  on  one  SPARC-IPC  was  9705  wall  clock  seconds 
(162  minutes).  Using  four  SPARC-IPCs,  the  median 
computation  time  was  2618  seconds  (44  minutes). 
The  distributed  elEciency  is  then  2618*4/9705  =  .95. 
This  result  is  quite  good,  but  as  Hawkins,  et.  al. 
point  out,  the  slowest  processor  will  determine  the 
overall  computation  time.  If  one  or  more  of  the  pro¬ 
cessors  in  busy  with  other  jobs,  then  the  computation 
time  could  be  much  longer.  For  example  if  one  of  the 
processors  can  only  devote  50%  effort  to  the  compu¬ 
tation  then  the  overall  computation  time  will  be  close 
to  twice  as  long. 

One  solution  to  the  problem  of  differing  loads 
on  the  CPUs  used  in  distributing  a  computation  is 
to  split  the  computation  into  many  small  parts  and 
then  send  the  parts  one  at  a  time  to  processors  as  they 
become  available.  In  this  way,  slower  processors  get 
fewer  of  the  the  parts  and  the  overcdl  computation 
time  is  likely  to  be  significantly  less  than  if  larger 
parts  were  sent  to  each  processor  as  in  Hawkins  et. 
al.,  thus  we  suggest  the  following  modification  to  the 
steps  required  to  distribute  a  serial  computation: 

1.  Modify  the  serial  code  so  that  it  can  compute  any 
portion  of  the  total  computation. 

2.  Identify  which  processors  are  available  to  assist 
in  the  computation. 

3.  Generate  a  large  number  of  input  files  splitting 
the  computation  into  reasonable  small  parts. 

4.  Construct  a  shell  script  that  sends  the  first  k  in¬ 
put  files  to  k  available  processors. 

5.  As  a  processor  finishes  its  computation,  the  out¬ 
put  is  appended  to  an  output  file  for  that  proces¬ 
sor  and  a  new  input  file  is  send  to  that  processor, 

6.  Collect  the  output  from  each  processor  and  re¬ 
port  the  final  solution. 

Software  that  implements  these  steps  is  available 
in  the  software  package  Chare  (Kale,  1990)  ,  used  by 
Raphael  Finkel  at  the  University  of  Kentucky.  The 
disadvantage  of  Chare  is  that  it  is  basically  a  pro¬ 
gramming  language  that  must  be  learned  and  it  run 
only  on  a  very  limited  number  of  platforms. 


Method 

Median  Perfect®  Distributed^ 

(sec)  Distribution  Efficiency 

SG 

HSS 

1  CPU 

2491'  2426  97 

2618“*  2426  93 

9705 

"median  time  4,  1  CPU 

**ratio  of  perfect  distribution  to  median  distributed  time 
^n=12,  x=2525,  s=65.3 
<^n=5,  x=:2569,  s=116 

Table  1:  Computation  Times  for  4  Sparc-IPCs 

The  Appendix  to  this  paper  contains  a  Bourne 
shell  script  that  can  replace  the  file  “distlms.sh”  of 
Hawkins  et.  al.  The  only  modifications  that  need  to 
be  made  to  the  other  programs  provide  in  Hawkins 
et.  al.  are  as  follows: 

1.  When  prompted  by  “Imsd.f”  for  the  workstation 
names,  respond  with  the  names  of  the  individual 

parts  of  the  computation,  e.g.,  pi,  p2, _ The 

shell  script  requires  that  the  input  files  have  a 
numbered  naming  convention.  We  recommend 
parts  of  equal  sizes. 

2.  The  program  calling.f  and  its  subroutines  (which 
we  refer  to  as  “Imsr.ff)  found  in  UNIX.PRG  (See 
Hawkins,  et.  al.)  must  be  compiled  for  each 
processor  it  will  be  executed  on.  Modify  Imsr.f 
so  that  it  will  print  its  output  to  a  file  called 
“Aosf.out”  by  adding  after  “READ(*,10)  OUT- 
FIL”  the  line  “OUTFIL  =  “host.out”.  (host  is 
the  name  given  to  the  machine  in  the  host  file 
for  the  shell  script,  e.g.  gani.out  for  the  host 
gani,  brahms.out  for  the  host  brahms,  . . .)  This 
must  be  done  for  each  host. 

3.  The  changes  needed  to  the  shell  script  in  the  Ap¬ 
pendix  for  the  user's  network. 

As  an  example,  we  partitioned  the  exact  compu¬ 
tation  of  the  LMS  estimate  for  “educat.dat”  discussed 
above  into  50  parts.  Table  1  contains  the  compu¬ 
tation  times  as  reported  in  Hawkins,  Simonoff  and 
Stromberg  (1994),  for  their  method  (HSS)  as  well  as 
results  for  the  new  Stromberg/Gardner  (SG)  method. 

Note  that  the  new  method  has  a  better  me¬ 
dian  distributed  efficiency.  More  importantly,  note 
the  lower  standard  deviation  of  the  runs  for  the  new 
method.  The  runs  for  Hawkins,  et.  al.,  are  more  vari¬ 
able  because  of  the  fact  that  the  times  are  highly  load 
dependent,  while  the  new  method  is  less  sensitive  to 
varying  loads  on  the  network. 


As  an  example  of  how  this  new  shell  script  takes 
advantage  of  the  processors  that  are  not  as  heavily 
loaded,  the  following  test  case  was  performed:  A  nu¬ 
merically  intensive  program  was  executed  on  a  Sun 
Sparc-IPC  (host  name  gani).  The  distributed  com¬ 
putation  was  performed  using  four  Sparc-IPCs  (gani, 
brahms,  bart,  and  Utah).  The  other  three  proces¬ 
sors  were  relatively  unloaded  compared  to  gani,  which 
could  dedicate  only  50%  of  its  processing  time  to  the 
calculations.  The  total  computation  time  was  3104 
seconds,  but  more  interesting  was  the  number  of  in¬ 
dividual  parts  of  the  computation  that  each  machine 
performed:  of  the  50  total,  brahms  did  13,  utah  did 
15,  bart  did  14,  and  gani  did  only  8.1f  each  of  the  ma^ 
chines  were  equally  loaded,  then  it  would  be  expected 
that  each  machine  would  do  25%  of  the  computation. 
In  this  case,  gani  had  a  workload  that  was  twice  as 
much  as  the  others,  and  accordingly  it  only  performed 
8/50  =  16%  of  the  computation.  Additionally,  under 
the  old  method  (HSS),  the  total  computation  time 
could  be  expected  to  be  about  5236  (2*2618)  seconds 
because  of  the  higher  load  on  gani.  Thus  the  new 
method  (SG)  demonstrates  about  60%  faster  comput¬ 
ing  time  compared  to  the  HSS  method. 
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Appendix 


# ! /bin/ sh 

#  Don^t  delete  this  first  line,  it  lets  the  system  know  which  Unix  shell 

#  to  use  when  executing  this  file.  We  are  using  the  Bourne  shell,  but 

#  other  closely  compatible  shells  could  be  used  with  minor  modification. 

#  This  is  a  UNIX  shell  script  that  will  execute  a  distributed 

#  computation  of  the  Least  Median  of  Squares  Regression  Equation 

#  (Ref:  “Distributing  a  Computationally  Intensive  Estimator:  The 

#  Case  of  Exact  LMS  Regression**,  Computational  Statistics  (1994),  by  D. 

#  Hawkins,  J.  Simonoff ,  and  A.  Stromberg.)  This  is  a  modification  of 

#  the  previous  method  of  distributing  the  simulation,  which  broke  up  the 

#  computation  into  several  parts,  one  for  each  machine  available,  emd 

#  executed  the  programs  remotely. 

# 

#  This  method  is  very  much  like  a  multiple  server,  single  line  queueing 

#  system  where  the  shell  sends  out  small  pieces  of  the  computation  to 

#  each  of  the  machines,  and  as  these  machines  become  ''available**  (i.e. 

#  finish  their  computation),  the  shell  will  send  a  new  job  to  the 

#  machine.  The  idea  is  to  use  the  machines  which  are  operating  quicker 

#  more  often  than  the  slower  ones. 

# 

#  As  with  all  shells,  this  file  must  be  given  execution  privilege  on 

#  your  machine.  This  is  done  on  most  machines  by  the  command: 

#  chmod  +x  filename 

# 

#  Modifications  made  by  Capt  Sam  Gardner,  U.S.  Air  Force. 

#  June  1994 

#  Beginning  of  the  Script 
HOSTFILE=hostfile 

#  This  variable  holds  the  name  of  the  file  which  contains  the  list 

#  of  hosts/processors  to  use.  Each  hostname  should  be  on  a  sepaxate 

#  line  of  the  file. 

OUTFILE=outfile 

#  This  variable  holds  the  name  of  the  file  into  which  all  of  the  output 

#  will  be  put  into. 

NUMSENT=0 

#  Variable  to  count  the  number  of  jobs  sent.  The  input  files  should 

#  have  a  numeric  naming  scheme,  e.g.  inputl,  input. 1,  input-1,  etc... 

#  In  this  example  (see  the  rsh  command  below),  the  input  files  are 

#  named  el,  e2,e3,  ...,  eSO 

NUMT0TAL=50 

#  Variable  which  contains  the  total  number  of  inputs/ jobs  to  be  executed, 

#  This  checks  to  see  if  you  created  the  file  defined  as  HOSTFILE.  If 
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#  not,  the  shell  informs  you  and  exits. 

if  [  !  -f  $H0STFILE  ] ;  then 
echo  "Cannot  find  $HOSTFILE" 

echo  "Create  file  $HOSTFILE  with  a  list  of  hosts  to  use.  Exiting  shell" 
exit  1 
fi 

#  This  checks  to  see  if  the  file  defined  as  OUTFILE  above  exists.  If  so, 

#  the  shell  will  ask  if  you  want  to  delete  it.  If  you  select  no,  the  shell 

#  will  exit. 

if  [  -f  $0UTFILE  ] ;  then 

echo  "The  output  file  $0UTFILE  exists.  Delete  it  and  continue?  (Y  or  N)" 
read  query 

if  [  $query  =  "Y"  ] ;  then 
rm  $0UTFILE 
else 
exit  1 
fi 
fi 

#  Opens  an  empty  file  with  the  name  stored  in  OUTFILE 
cat  /dev/null  >  $0UTFILE 

#  This  loop  cleans  up  any  files  left  over  from  a  previous  execution  of 

#  this  shell  and  sets  up  some  flag  files  that  the  shell  needs  later  on. 

#  Note  that  this  requires  a  separate  file  called  "falsefile"  which 

#  contains  a  single  word,  FALSE.  Later  on  a  file  called  "truefile"  will 

#  be  needed  also,  and  truefile  should  contain  only  the  word  TRUE. 

for  name  in  *cat  $H0STFILE‘;  do 
rm  $name.out  2>/dev/null 
rm  $name.sent  2>/dev/null 

#  puts  the  word  FALSE  into  the  machine. sent  file 

cat  falsefile  >  $name.sent 
cat  /dev/null  >  $name.job 
done 


#  The  following  while  loop  will  check  to  see  first  if  NUMSENT  is  less 

#  than  NUMTOTAL.  It  then  checks  to  see  if  a  machine/host  in  the  list 

#  in  HOSTFILE  is  busy.  If  not,  it  sends  a  remote  job  to  that  machine, 

#  increments  NUMSENT  by  1,  stores  the  number  of  the  input  file,  and  puts 

#  the  word  TRUE  in  the  host. sent  file.  If  the  host  is  busy,  then  it 

#  moves  to  the  next  machine.  If  a  host  has  completed  a  job,  the  file 

#  host. out  will  exist  and  the  shell  will  append  the  output  to  OUTFILE, 

#  remove  host. out,  and  put  the  word  FALSE  into  host. sent,  letting  the 

#  shell  know  that  the  host  is  now  available  to  execute  a  job.  Otherwise, 

#  the  shell  lets  you  know  that  the  host  is  busy. 

DIRECTORY=directory_where_the_executables_are 
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while  [  '»$NUMSENT"  -It  **$WUMT0TAL"  ]  ;  do 
for  name  in  'cat  $H0STFILE';  do 
hostflag='cat  $name.sent' 
if  C  **$NUMSEWT"  -It  **$NUMT0TAL''  ] ;  then 
if  C  $hostf lag  =  "FALSE”  ] ;  then 
NUMSENT='expr  $WUMSENT  +  1' 

rsh  $name  -n  "cd  "$DIRECT0Ry";  $name.lms  <  e$NUMSENT"  k 

#  Note  that  the  executable  files  should  all  have  a  naming  pattern  based 

#  on  the  host  name.  In  this  case,  the  executables  and  input  files  are 

#  all  in  the  same  directory,  with  the  executable  files  having  names 

#  "gani.lms",  "brahms.lms",  etc...  The  variable  DIRECTORY  should  be 

#  changed  to  the  working  directory.  If  the  input  jobs  are  names 

#  differently  from  e$NUMSENT  then  that  variable  should  be  changed 

#  accordingly, 

rm  $name.sent 

cat  truefile  >  $name.sent 

rm  $narae.job 

echo  "$NUMSENT"  >  $name.job 
echo  "Job  $NUHSENT  sent  to  $name" 
el if  [  -f  $name.out  ];  then 

echo  "$name  completed  job,  adding  $name.out  to  $0UTFILE" 
rm  $name.sent 
cat  falsefile  >  $name.sent 
cat  $name.out  »  $0UTFILE 
rm  $name.out 
else 

echo  "job  pending  at  $name" 
fi 
fi 

done 

done 

#  Now  all  of  the  jobs  have  been  sent  and  the  shell  will  wait  until 

#  they  are  all  complete.  It  will  tell  you  which  input  file  it  is 

#  waiting  on,  also,  so  if  you  are  waiting  a  long  time  for  one  of 

#  final  jobs  to  complete,  you  can  kill  the  shell  and  run  the  last 

#  job  manually  on  a  faster  machine. 

echo  "All  jobs  have  been  sent,  waiting  for  final  jobs  to  complete" 

for  name  in  'cat  $H0STFILE';  do 
hostf lag= ' cat  $name.sent' 
if  C  $hostf lag  =  "FALSE"  ] ;  then 

echo  "No  jobs  pending  at  host  $name" 
else 

job='cat  $name.job' 
until  [  -f  $name.out  ] ;  do 

echo  "Waiting  for  $name  to  complete  job  $job" 
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#  If  this  is  looping  too  fast  and  filling  up  the  screen,  the 

#  following  line  can  be  used  in  the  shell  (just  delete  the  pound  sign) 

#  sleep  5 
done 

echo  **$name  completed  final  job,  adding  $name.out  to  $OUTFILE‘* 
cat  $name.out  »  $OUTFILE 
rm  $naine.out 


done 

#  The  following  line  executes  the  program  that  computes  the  LMS  fit 

#  from  the  output  file. 

collect 

#  Finally  the  LMS  fit  is  printed  to  the  screen, 
more  Imsfit 

#  End  of  the  shell 
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Abstract 

In  this  paper  we  give  new  insights  into  why  the  problem 
of  detecting  multivariate  outliers  can  be  difficult  and  why 
the  difficulty  increases  with  the  dimension  of  the  data. 
We  then  describe  significant  improvements  in  methods 
for  detecting  outliers  and  demonstrate  using  extensive 
simulation  experiments  that  a  hybrid  method  extends 
the  practical  boundaries  of  outlier  detection  capabilities. 
Based  on  simulation  results,  we  investigate  the  question 
of  what  levels  of  contamination  can  be  detected  by  this 
algorithm  as  a  function  of  dimension,  computation  time, 
sample  size,  contamination  fraction,  and  distance  of  the 
contamination  from  the  main  body  of  data.  A  more 
detailed  presentation  on  this  topic  is  contained  in  Rocke 
and  Woodruff  (1994). 

1  Introduction 

While  methods  of  detection  of  sporadic  outliers  in  mul¬ 
tivariate  data  have  existed  for  many  years  (see  Hawkins 
1980),  the  problem  of  detecting  clusters  of  outliers  can 
be  extremely  difficult.  This  essentially  requires  robust 
estimation  of  multivariate  location  and  shape,  and  most 
estimators  for  the  latter  problem  are  known  to  fail  when 
the  firaction  of  contamination  is  greater  than  l/(p+  1), 
where  p  is  the  dimension  of  the  data.  Thus  detecting 
outliers  or  a  disparate  population  that  compose  more 
than  a  small  fraction  of  the  data  has  been  impractical  in 
high  dimension. 

In  this  paper  we  give  new  insights  into  why  the  prob¬ 
lem  of  detecting  multivariate  outliers  is  so  difficult  and 
why  the  difficulty  increases  with  the  dimension  of  the 
data.  We  then  describe  significant  improvements  in 
methods  for  detecting  outliers  and  demonstrate  using 
extensive  simulation  experiments  that  a  hybrid  method 
extends  the  practical  boundaries  of  outlier  detection  ca- 
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pabilities.  Determination  of  the  exact  boundaries  is  com¬ 
plicated  by  the  fact  that  the  probability  of  detecting  out¬ 
liers  depends  on  many  things  such  as  the  computer  time 
expended,  dimension,  number  of  data  points,  fraction 
of  data  contaminated,  type  of  contamination  and  algo¬ 
rithm  parameters.  Nonetheless,  based  on  simulations  we 
are  able  to  specify  approximately  what  levels  of  contami¬ 
nation  can  be  detected  by  this  algorithm  under  a  variety 
of  conditions. 

The  estimation  of  multivariate  location  and  shape  is 
one  of  the  most  difficult  problems  in  robust  statistics 
(Campbell  1980,  1982;  Davies  1987;  Devlin,  Gnanade- 
sikan,  and  Kettenring  1981;  Donoho  1982;  Hampel  et 
al.  1986;  Huber  1981;  Lopuhaa  1989  Maronna  1976; 
Rocke  and  Woodruff  1993;  Rousseeuw  1985;  Rousseeuw 
and  Leroy  1987;  Stahel  1981;  Tyler  1983,  1991).  For 
some  statistical  procedures,  it  is  relatively  straightfor¬ 
ward  to  obtain  estimates  that  are  resistant  to  a  reason¬ 
able  fraction  of  outliers— for  example,  one-dimensional 
location  (Andrews  et  al.  1972)  and  regression  with  error- 
free  predictors  (Huber  1981).  The  multivariate  location 
and  shape  problem  is  more  difficult,  since  most  known 
methods  will  break  down  if  the  fraction  of  outliers  is 
larger  than  l/(p  +  1),  where  p  is  the  dimension  of  the 
data  (Maronna  1976;  Donoho  1982;  Stahel  1981).  This 
means  that,  in  high  dimension,  a  very  small  fraction  of 
outliers  can  result  in  very  bad  estimates. 

We  are  particularly  interested  in  obtaining  estimates 
that  are  affine  equivariant  A  location  estimator  E 
is  affine  equivariant  if  and  only  if  for  any  vector  b  E 
and  any  non-singular  p  x  p  matrix  A 

tn{AX^b)^Atn(X)  +  b. 

A  shape  estimator  Cn  E  PDS(p)  is  affine  equivariant  if 
and  only  if  for  any  vector  b  E  and  any  non-singular 
pxp  matrix  A 

C^{AX  +  6)  =  ACr,{X)A^ 

This  implies,  for  example,  that  stretching  or  rotating 
measurement  scales  will  not  change  the  estimates.  Drop¬ 
ping  the  requirement  of  affine  equivariance  does  increases 
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the  number  of  estimators  that  are  available,  and  there 
may  certainly  be  cases  where  a  non- affine-equi variant  es¬ 
timator  provides  superior  performance,  but  it  is  also  im¬ 
portant  to  have  robust,  computable,  affine-equivariant 
estimators  available  for  use.  In  fact,  though,  we  know  of 
no  non-affine-equivariant  estimator  that  can  deal  with 
diflS-Cult  outliers  any  better  than  the  best  of  the  affine 
equivariant  methods. 

Computational  methods  have  been  reported  in  the  lit¬ 
erature  for  a  number  of  approaches  for  finding  robust  es¬ 
timates  of  multivariate  location  and  shape  (and  therefore 
identifying  outliers).  Combinatorial  estimators,  such  as 
the  minimum  volume  ellipsoid  (MVE)  and  minimum  co- 
variance  determinant  (MCD)  estimators  of  Rousseeuw 
(1985;  Hampel  et  al.  1986;  Rousseeuw  and  Leroy  1987), 
have  been  addressed  with  random  search  (Rousseeuw 
and  Leroy  1987:  MinVol),  steepest  descent  with  ran¬ 
dom  restarts  (Hawkins  1993a,  1993b:  FSA),  and  heuris¬ 
tic  search  optimization  efforts  (Woodruff  and  Rocke 
1993a,  1993b).  Iterative  estimators  such  as  maximum 
likelihood  and  M-estimators  (Campbell  1980,  1982;  Hu¬ 
ber  1981;  Kent  and  Tyler  1991;  Lopuhaa  1992;  Maronna 
1976;  Rocke  1992;  Tyler  1983,  1988,  1991),  and  5- 
estimators  (Davies  1987;  Hampel  et  al.  1986;  Lopuhaa 
1989  Rousseeuw  and  Leroy  1987)  can  be  computed  with 
a  straightforward  iteration  from  a  good  starting  point 
(Rocke  and  Woodruff  1993)  or  using  an  ad  hoc  search  for 
the  global  minimum  (Ruppert  1992:  Surreal).  Sequen¬ 
tial  point  addition  estimators  (Forward)  have  been 
defined  algorithmically  by  Atkinson  (1992)  and  Hadi 
(1992)  working  separately.  The  Hadi  paper  suggests 
the  use  of  a  non-affine  equivariant  starting  point,  but 
the  point  addition  portion  of  the  algorithm  is  affine- 
equivariant  and  is  nearly  the  same  as  the  point  addi¬ 
tion  portion  of  Atkinson’s  completely  affine-equivariant 
algorithm. 

In  the  remainder  of  the  paper,  we  discuss  that  nature 
of  multivariate  outliers,  with  a  special  view  to  what  sorts 
of  outliers  are  worth  studying.  We  show  that  outliers 
that  have  the  same  shape  as  the  main  data  are  in  some 
sense  the  hardest  to  find,  and  that  the  more  compact 
the  outliers  are,  the  harder  they  are  to  find.  We  adopt 
shift  outliers  as  a  reasonable  target,  being  of  the  hardest 
shape,  but  of  a  feasible  size  to  locate.  Then  we  study 
the  comparative  performance  of  the  our  new  hybrid  al¬ 
gorithm  and  previous  methods  such  as  MinVol,  FSA, 
and  Forward,  demonstrating  the  superiority  of  the  new 
method.  Then  we  investigate  the  question  of  what  prob¬ 
lems  can  be  practically  tackled  with  our  methods. 


2  The  Nature  of  Multivariate 
Outliers 

In  this  section,  we  investigate  the  difficulties  of  locating 
multivariate  outliers.  First,  to  frame  the  problem  as  this 
paper  deals  with  it,  we  assume  that  there  is  a  fraction 
greater  than  one-half  of  the  data  that  come  from  a  well- 
behaved  multivariate  population,  for  example  multivari¬ 
ate  normal.  Of  course,  in  practical  cases,  data  trans¬ 
formations,  may  be  required  before  this  plausibly  holds. 
In  addition  to  the  well-behaved  data,  there  are  other 
data  that  do  not  fit  the  pattern  of  this  well-behaved 
majority — they  may  arise  from  a  distinct  population,  or 
may  be  measurement  errors;  all  that  is  required  is  that 
the  pattern  of  these  data  points  is  different  from  the  re¬ 
mainder.  We  will  sometimes  refer  to  the  majority  of  the 
data  that  come  from  that  well-behaved  population  as 
the  good  data,  and  the  remainder  as  the  bad  data.  There 
is  supposed  to  be  no  implication  that  the  bad  data  are 
necessarily  errors — they  may  just  arise  from  a  distinct 
sub-population — but  the  locution  is  convenient. 

A  second  aspect  of  our  viewpoint  on  this  problem  is 
that  we  aspire  to  methods  that  are  affine  equivariant,  so 
that  measurement  scale  changes  or  other  linear  transfor¬ 
mations  do  not  alter  the  behavior  of  analysis  methods. 
An  imphcation  of  this  viewpoint  is  that  Mahalanobis 
distances  become  very  important,  since  these  are  among 
the  few  potentially  affine-equivariant  outlier  identifica¬ 
tion  criteria. 

Definition  1  Lei  O  be  a  'positive  definite  symmetric  px 
p  matrix.  The  Mahalanobis  Distance  betiveen  points  x 
and  y  in  with  respect  to  fi  is  defined  by 

d}2{x,y)  =  ix-y)'^Q-^{x-y).  (1) 

We  refer  to  the  distance  and  the  matrix  that  defines  it 
interchangeably  as  a  metric. 

For  data  like  those  we  consider  here,  the  true  metric 
is  the  covariance  matrix  of  the  population  from  which 
the  good  data  arise;  a  good  metric  is  one  which  is  close 
to  the  true  metric.  In  particular,  when  the  covariance  of 
the  whole  sample  differs  by  a  lot  from  the  covariance  of 
the  good  data,  a  good  metric  is  one  that  resembles  the 
latter  rather  than  the  former. 

We  will  find  it  convenient  to  distinguish  the  size  and 
shape  of  a  metric  as  follows: 

Definition  2  Let  ft  be  a  matrix  defining  a  metric.  The 
size  of  the  metric  is  the  determinant  \ft\.  The  shape 
of  the  metric  is  the  equivalence  class  of  metrics  3  such 
that  Ql\Q\  =  Sl\S\.  Equivalently,  we  may  identify 
the  shape  as  the  member  of  the  equivalence  class  with 
determinant  1;  that  is,  n/\ft\. 
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This  leads  to  similar  definition  of  shape  and  size  for  sam¬ 
ples. 

Definition  3  Lei  X  be  an  n  x  p  matrix  representing  a 
sample  ofn  points  inW,  Lei  S  =  n'^^{X —  Xy{X —X) 
be  the  sample  covariance  matrix.  The  size  or  scale  ofX 
is  the  determinant  |iS^|  of  its  covariance  matrix,  and  the 
shape  of  X  is  5/|5|.  By  extension,  we  refer  to  the  size 
and  shape  of  other  covariance-like  estimators,  such  as 
the  robust  ones  that  are  the  subject  of  this  paper. 

We  now  consider  the  question  of  what  kind  of  outliers 
are  hard  to  find.  We  begin  by  examining  the  case  in 
which  a  good  metric  is  available.  This  is  the  goal  of  most 
affine-equivariant  outlier  identification  methods — ^find  a 
good  metric  so  that  the  outliers  will  reveal  themselves. 

Lemma  1  Consider  a  sample  ofn  points  in  Let  the 
^‘good”  data  have  mean  fXg  and  covariance  Sq.  Let  the 
“had^*  data  have  mean  +  and  covariance  matrix  fl, 
and  let  this  comprise  a  fraction  a  of  the  overall  data. 
Then  the  expected  sample  mean  and  covariance  matrix 
are  as  follows: 

E{x)  =  +  (2) 

E{S)  =  (1  —  Oi)I!o  +  (3) 

Proof.  See  Rocke  and  Woodruff  (1994).  D 

Theorem  1  Consider  a  sample  of  n  points  in  Let 
the  “good”  data  be  multivariate  normal  with  mean  pb^ 
and  covariance  S^.  Let  the  “bad”  data  be  multivari¬ 
ate  normal  with  mean  +  M  covariance  matrix  ft. 
Consider  the  Mahalanobis  square  distance  (®jA^)  of 
a  point  from  the  true  mean  using  the  true  metric.  Then, 
for  a  fixed  location  displacement  pb  and  size  \  ft\  of  the 
outliers,  the  expectation  of  the  Mahalanobis  square  dis¬ 
tance  of  a  bad  point  from  the  true  mean  is  least  when  the 
shape  of  ft  is  the  same  as  the  shape  of  Uq.  This  is  thus 
the  worst  case  from  a  detection  point  of  view. 

Proof.  See  Rocke  and  Woodruff  (1994).  D 

The  above  theorem  implies  that  the  hardest  kind  of 
outliers  to  find,  when  a  good  metric  is  available,  is  the 
kind  that  have  a  covariance  matrix  with  the  same  shape 
as  the  good  data.  For  this  situation,  this  reduces  the 
infinitely  variable  kinds  of  outliers  to  a  single  kind.  If 
this  kind  of  outlier  can  then  be  detected,  so  should  other 
kinds.  We  intend  therefore  to  focus  on  a  situation  in 
which  there  are  good  data  drawn  from  a  multivariate 
normal  distribution,  and  bad  data  drawn  from  the  same 
distribution  and  then  displaced.  These  are  often  called 
shift  outliers  (Hawkins  1980;  Rocke  and  Woodruff  1993). 

Shift  outliers  may  be  contrasted  with  classes  of  out¬ 
liers  that  may  be  easy  to  detect,  in  the  sense  of  appearing 


disparate  even  with  the  bad  metric  obtained  by  using  all 
the  data.  For  easily  detected  outliers,  no  fancy  robust 
techniques  are  required,  merely  examining  the  Maha¬ 
lanobis  distances  from  the  mean  of  the  data  using  the 
covariance  matrix  of  the  data  will  suffice.  While  we  have 
seen  that  the  shape  for  bad  data  that  maximizes  their 
masking  is  the  shape  of  the  good  data,  we  have  not  yet 
addressed  the  issue  of  size.  The  next  theorem  shows  how 
easy  detection  is  a  consequence  of  the  number  and  size 
of  the  contamination. 

Theorem  2  Consider  a  sample  ofn  points  in  Let  the 
“good”  data  be  multivariate  normal  with  mean  piQ  and 
covariance  Sq.  Let  the  “bad”  data  be  multivariate  nor¬ 
mal  with  mean  a*o  +  A*  covariance  matrix  ft  =  A17o, 
and  let  this  comprise  a  fraction  a  of  the  overall  data.  Let 
S  be  the  expected  covariance  matrix  of  the  mixed  sam¬ 
ple  as  above  and  consider  d?jj{x,  a  pb),  the  Mahalanobis 
square  distance  in  the  bad  metric  between  a  data  point  x 
and  the  overall  population  mean.  Then 

1.  The  difference  in  the  value  of  E{d/jj{x,otpf))  for  a 
bad  point  and  the  value  for  a  good  point  for  large  t} 
is  an  increasing  function  of  X,  so  that  A  =  0  is  the 
worst  case. 

2.  If  X  =  0,  so  that  the  outliers  form  a  point  mass, 
and  if  7}  is  large,  then  the  value  of  E{d?^{x,ocpb)) 
for  a  bad  point  is  less  than  the  value  for  a  good  point 
whenever  a  >  l/(p  -h  1) . 

S.  J/A  =  1  (pure  shift  outliers),  and  ifr)  is  large,  then 
the  value  of  E(d^jj(Xj  apt.))  for  a  bad  point  is  always 
larger  than  the  value  for  a  good  point.  However,  for 
large  p,  the  distribution  of  the  distance  of  a  good 
point  and  the  distribution  of  the  distance  of  a  bad 
point  converge. 

4.  For  large  7],  the  value  of  X  at  which  E{d’^^{x,apb)) 
has  the  same  value  for  good  points  and  baa  points  is 

(1  -  a)(ap  -  (1  -  a)) 

Q'((l  —  a)p  —  a) 

whenever  this  is  positive. 

Proof.  See  Rocke  and  Woodruff  (1994),  □ 

Remark  \  If  a  good  starting  estimate  for  the  shape  of 
the  good  data  can  be  found,  then  the  hardest  kind  of  con¬ 
tamination  to  discover  is  that  which  has  the  same  shape 
as  the  good  data.  Since  substantial  contamination  can 
only  he  found  by  constructing  a  relatively  good  shape  esti¬ 
mate,  this  is  the  most  difficult  case  for  such  search  meth¬ 
ods. 
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Remark  2  Although  poinUmass  contamination  is  the 
most  difp,cult  to  detect  by  the  Mahalanobis  distance  from 
the  sample  mean^  it  is  easy  to  detect  in  other  ways,  such 
as  pair-wise  distances. 

Remark  3  Although  pure  shift  outliers  might  seem  to  be 
detectable,  given  that  their  mean  Mahalanobis  distance 
from  the  sample  mean  is  larger  than  that  of  the  good 
points,  no  method  is  known  that  can  find  the  outliers 
with  complete  assurance.  This  is  because  the  overlap  in 
the  distributions  is  very  substantial. 

As  we  shall  see  later,  pure  shift  outliers  are  sufficient 
to  baffle  previously  proposed  methods  like  the  random 
search  algorithm  in  the  program  MinVol  (Rousseeuw 
1985).  Others  like  those  proposed  by  Hawkins  (1992) 
and  Atkinson  (1993)  turn  out  to  be  better  than  random 
search.  The  method  proposed  in  this  paper,  however, 
dominates  all  other  methods  examined  in  high  dimen¬ 
sion. 

Because  we  are  mainly  interested  in  high  dimension, 
we  will  rely  primarily  on  extensive  computational  exper¬ 
iments  to  compare  methods,  rather  than  the  standard, 
low-dimensional  examples  often  used  in  the  literature. 
However,  we  did  examine  the  performance  of  the  code 
on  some  of  these  standard  examples,  such  as  the  data 
of  Hawkins,  Bradu,  and  Kass  (1984),  achieving  the  ex¬ 
pected  outcomes.  For  the  reasons  outlined  in  this  sec¬ 
tion,  the  experiments  involve  mainly  pure  shift  outliers, 
although  a  few  other  cases  were  examined  to  check  for 
any  sensitivity  to  this  specification.  Dimensions  as  large 
as  50  were  examined,  even  though  the  computation  times 
can  rise  rapidly  with  the  dimension,  so  that  high  dimen¬ 
sional  cases  would  be  represented.  Previously,  the  liter¬ 
ature  has  concentrated  almost  exclusively  on  dimensions 
less  than  10,  and  usually  no  larger  than  five.  Methods 
that  appear  satisfactory  for  a  problem  with  three  dimen¬ 
sions  and  20  data  points  can  be  completely  impractical 
for  even  somewhat  larger  problems  (Woodruff  and  Rocke 
1993a).  We  examine  a  range  of  contamination  fractions 
from  l/(p-h  1),  which  is  the  smallest  non-trivial  amount 
of  contamination,  to  40%  or  45%,  which  can  be  almost 
impossible  to  find.  There  is  a  theoretical  limit  on  the 
number  of  contaminated  points  that  could  be  found  even 
in  principle;  the  number  of  good  points  must  be  at  least 
h  =  (n  “f  p  -h  l)/2  (Lopuhaa  and  Rousseeuw  1991).  The 
good  data  are  defined  to  be  multivariate  standard  nor¬ 
mal  and  the  bad  data  to  be  multivariate  unit  normal 
with  a  shifted  mean.  We  measure  the  amount  of  .shift 
in  terms  of  the  unit  of  measurement  Qp  =  yXpjo.oou 
which  is  more  or  less  the  radius  of  the  sphere  around 
the  mean  that  contains  almost  all  the  good  points.  If 
the  outliers  are  centered  at  a  distance  of  2Qp,  then  these 


spheres  should  not  overlap.  We  implement  outliers  at 
a  distance  of  dQp  by  adding  dQ*  to  each  component, 

where  Q*  =  yJxp-o.ooJP-  This  plac  es  the  outliers  at  the 
correct  distance  out  a  diagonal.  In  the  experiments  used 
in  this  paper,  we  use  d  =  2,  which  we  call  close  outliers, 
and  d  =  4,  which  we  call  far  outliers. 

This  generation  mechanism  is  sufficient  for  use 
with  affine-equivariant  methods,  but  for  non-affine- 
equivariant  methods,  the  data  should  then  be  standard¬ 
ized  so  that  the  entire  sample  has  mean  O  and  covariance 
J.  This  can  be  accomplished  using  the  singular  value  de¬ 
composition  as  follows.  Let  S  be  the  covariance  matrix 
of  the  whole  sample  of  good  and  bad  data.  This  can 
be  written  as  s  =  q'^dq,  where  Q  is  an  orthogonal 
matrix  and  D  is  the  diagonal  matrix  of  eigenvalues.  If 
X  is  the  sample,  then  the  sample  XQ^ D'  has  the 
desired  properties. 

One  convenient  aspect  of  the  use  of  shift  outliers  in 
this  problem  is  that  iterative  methods  such  as  M-  and  S- 
estimation  usually  have  at  most  two  roots:  one  that  can 
be  found  by  iterating  from  the  good  data  (the  good  root) 
and  one  that  occurs  when  iterating  from  all  the  data  (the 
bad  root).  For  small  amounts  of  contamination,  these 
may  not  be  distinct,  but  only  when  they  do  differ  is  the 
problem  interesting. 

Finally,  we  define  the  criterion  of  success  for  an  outlier 
detection  method.  If  the  method  yields  a  location  fi 
and  a  metric  17,  then  the  method  is  successful  if  the 
largest  value  of  for  a  good  point  is  smaller 

than  the  smallest  value  for  a  bad  point.  This  is  a  very 
strict  criterion,  but  some  experimentation  has  suggested 
that  the  ordering  of  the  methods  is  not  changed  by  use 
of  a  looser  criterion.  With  pure  shift  outliers  shifted  by 
dQp,  this  is  essentially  always  possible  if  d  >  2  and  if  the 
metric  is  a  good  one. 

3  Affine-Equivariant  Methods 
for  Outlier  Detection 

All  known  methods  for  this  problem  consist  of  the  fol¬ 
lowing  three  steps: 

1.  Estimate  a  location  and  metric. 

2.  Scale  the  metric  so  that  it  agrees  on  some  calibrating 
distribution. 

3.  Reject  as  outliers  points  whose  Mahalanobis  dis¬ 
tance  from  the  location  estimate  are  sufficiently 
large. 

The  last  two  steps  are  not  difficult,  so  the  essence  of  the 
problem  comes  down  to  highly  resistant  estimation  of 
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multivariate  location  and  shape.  All  methods  for  this 
problem  known  to  us  come  down  to  such  an  estimation 
problem.  These  methods  fall  into  two  classes:  combina¬ 
torial  and  iterative.  Combinatorial  estimators  construct 
estimates  of  location  and  shape  from  a  subset  of  the  data 
which  itself  is  hoped  to  be  at  least  mostly  outlier-free.  It¬ 
erative  estimators  attempt  to  satisfy  a  continuous  equa¬ 
tion  by  iteration  from  a  starting  point.  Unless  iteration 
from  the  whole  sample  mean  and  covariance  suflSces — 
an  uninteresting  case — this  requires  either  direct  search 
or  use  of  a  prior  combinatorial  estimator  as  a  starting 
point. 

Our  point  of  comparison  is  the  random  search 
algorithm  MinVol  for  the  MVE  (Rousseeuw  1985; 
Rousseeuw  and  Leroy  1987;  Roussseeuw  and  van 
Zomeren  1990,  1991).  Until  very  recently,  this  was  effec¬ 
tively  the  state  of  the  art. 

Our  proposed  method  is  outlined  below;  the  rest  of 
this  section  is  devoted  to  describing  the  steps  in  more 
detail  and  to  comparing  the  method  to  those  in  the  pre¬ 
vious  literature.  We  will  refer  to  the  complete  method  as 
the  hybrid  algorithm  because  it  uses  both  combinatorial 
and  iterative  features,  as  well  as  incorporating  several 
other  useful  heuristics. 

1.  Randomize  the  order  of  the  data  points. 

2.  Partition  the  data  into  cells  indexed  by  j. 

3.  For  each  cell, 

(a)  Spend  Tl\n/^{p)\  seconds  on  a  Tabu  Search 
for  the  MCD  (WoodruflF  and  Rocke  1993b). 

(b)  Use  MCD  estimate  as  a  starting  point  for  a 
sequential  point  addition  algorithm  using  the 
entire  sample  of  size  n  starting  from  the  p  -f  1 
points  that  have  the  smallest  distance  from  the 
MCD  location  using  the  MCD  metric. 

(c)  Use  this  result  as  the  starting  point  for  trans¬ 
lated  bi-weight  M-estimation  (Rocke  1993)  us¬ 
ing  the  entire  sample  of  size  n.  This  yields  es¬ 
timates  and  i7j  of  location  and  shape. 

4.  Select  the  index  j  for  which  \Sj\  is  least  and  set 
fi  =  fij  and  IJ  =  IJj. 

5.  Resize  S  so  that  the  median  distance  is  consistent 
with  an  assumed  (e.g.,  normal)  distribution;  that  is, 
multiply  by  Xp  h/nl'^^  where  m  is  the  hth.  largest 
Mahalanobis  square  distance  using  the  metric  17. 

6.  Reject  as  outliers  those  points  whose  Mahalanobis 
distances  exceed  a  chosen  Xp  quantile. 


3.1  M-  and  ^-Estimation 

An  5-estimate  of  multivariate  location  and  shape  is  de¬ 
fined  as  that  vector  t  and  PDS  matrix  C  which  mini¬ 
mizes  |C|  subject  to 

n~^Ylp  ([(*,•  -  tyc~^(xi  -  =  bo  (4) 

which  we  write  as 

n~^^p{di)  =  bo.  (5) 

It  has  been  shown  by  Lopuhaa  (1989)  and,  using  a  differ¬ 
ent  method,  by  Rocke  (1993),  that  5-estimators  are  in 
the  class  of  M-estimators  with  standardizing  constraints 
with  weight  functions  ui(d)  =  w{d)^  V2{d)  =  pn;(d), 
V3(d)  =  v{d)j  where  ip{d)  =  p'(d),  w(d)  =  7p(d)/dy 
v(d)  =  i>{d)dy  with  constraint  (5). 

In  Rocke  (1993)  it  is  shown  that  5-estimators  in  high 
dimension  can  be  sensitive  to  outliers  even  if  the  break¬ 
down  point  is  set  to  be  near  50%.  We  utilize  the  trans¬ 
lated  biweight  (or  t-biweight)  M-estimation  method  de¬ 
fined  in  Rocke  (1993),  with  a  standardization  step  con¬ 
sisting  of  equating  the  median  of  p(d*)  with  the  median 
under  normality.  This  is  then  not  an  5-estimate,  but  is 
instead  a  constrained  M-estimate. 

In  accord  with  the  theory  in  Rocke  (1993),  we  have 
found  that  the  use  of  the  t-biweight  M-estimator  makes 
a  large  improvement  in  the  performance  of  the  hybrid 
algorithm  compared  to  the  use  of  biweight  5-estimation, 
at  least  when  the  outliers  lie  relatively  close  in  (c?  =  2). 
When  d  =  4,  use  of  one  iterative  estimation  method  or 
the  other  made  no  important  difference.  Some  detailed 
evidence  is  given  in  Table  1.  The  situation  here  is  that 
twenty  replicates  of  shift  outliers  at  df  =  2  and  with  indi¬ 
cated  sample  size,  fraction  of  outliers,  and  computation 
time  allowed  (all  computation  times  are  CPU  seconds 
on  a  DECS  tat  ion  5000/200).  The  response  is  the  per¬ 
centage  of  replicates  for  which  the  indicated  estimator 
achieved  the  good  root.  Note  that  the  t-biweight  perfor¬ 
mance  exceeds  that  of  the  biweight  5-estimate  by  large 
amounts  in  every  case.  A  large  number  of  additional 
experiments  confirm  this  important  difference  in  perfor¬ 
mance. 

3.2  Partitioning 

The  simple  iteration  scheme  for  M-estimation  fails  with¬ 
out  a  good  starting  point.  An  M-estimator  that  begins 
iteration  using  an  estimate  based  on  all  the  data  breaks 
down  with  l/(p+ 1)  of  the  data  contaminated  (Maronna 
1976).  Two  methods  of  addressing  this  problem  seem 
possible.  One  is  to  look  directly  for  the  global  minimizer 
of  the  5  criterion.  The  other  is  to  find  a  good  starting 
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Table  1:  Comparison  of  Biweight  5-estimation  with  t- 
biweight  M-estimation.  The  columns  headed  are 
the  percentage  of  20  trials  that  the  given  estimator  cor¬ 
rectly  identified  the  outliers. 


n 

a 

time  (sec) 

biweight  % 

t-biweight  % 

50 

0.30 

22 

5 

50 

50 

0.30 

202 

5 

70 

50 

0.35 

22 

0 

20 

50 

0.35 

202 

0 

25 

200 

0.30 

60 

55 

95 

200 

0.30 

240 

55 

95 

200 

0.35 

60 

0 

35 

200 

0.35 

240 

0 

55 

point  for  the  iteration  by  use  of  a  preliminary  combina¬ 
torial  estimator. 

Ruppert  (1992)  proposed  an  algorithm  called  Sur¬ 
real  for  direct  search  for  the  global  minimizer  of  an 
5  estimator  used  in  multiple  regression.  He  reported 
computational  experiments  that  demonstrated  the  effec¬ 
tiveness  of  the  Surreal  for  this  purpose.  In  the  same 
paper,  he  proposed  an  extension  of  the  method  to  ro¬ 
bust  estimation  of  multivariate  location  and  shape.  It 
appears  Surreal  is  not  as  effective  for  this  problem  as 
for  regression.  In  dimension  10,  Surreal  rarely  found 
the  good  root  when  the  fraction  of  contamination  was 
greater  than  about  12%.  Since  this  was  not  competitive 
with  other  algorithms  examined,  detailed  results  are  not 
presented. 

We  also  have  examined  direct  search  as  a  method  of 
finding  the  good  root  for  5-  or  M-estimation  and  have 
found  that  it  seems  superior  to  use  a  preliminary  combi¬ 
natorial  estimator  such  as  the  MCD  (Rousseeuw  1985). 
As  pointed  out  by  Woodruff  and  Rocke  (1993b),  the  use 
of  the  MCD  to  find  a  good  starting  point  presents  se¬ 
vere  computational  difficulties.  Regardless  of  which  al¬ 
gorithms  are  used  to  compute  them,  combinatorial  esti¬ 
mators  such  as  the  MCD  search  a  space  that  increases 
exponentially  with  the  sample  size  and  the  dimension.  In 
fact,  when  using  the  MCD  as  a  first  stage  in  a  two-stage 
estimator,  one  can  have  the  perverse  situation  of  being 
made  worse  off  by  having  more  data.  To  cope  with  this 
problem,  the  data  must  be  partitioned  so  that  the  search 
space  for  the  MCD  is  kept  in  a  reasonable  range.  After 
some  modest  experimentation,  we  settled  on  a  cell  size 
of  7  =  5p.  This  may  possibly  be  too  small  for  high  di¬ 
mension,  but  determining  the  optimal  value  was  beyond 
the  scope  of  the  present  paper. 


As  shown  in  Woodruff  and  Rocke  (1993b),  use  of  data 
partitioning  in  this  fashion  allows  the  acquisition  of  the 
good  root  with  high  probability  with  a  computational 
time  increasing  only  linearly  with  n  (instead  of  expo¬ 
nentially). 

3.3  Sequential  Point  Addition 

Working  separately,  Hadi  (1992)  and  Atkinson  (1992) 
have  proposed  algorithms  which  begins  with  an  estimate 
of  shape  and  location  based  on  (p  +  1)  points  and  then 
selects  successively  larger  sets — the  set  with  1?  -fl  points 
is  consists  of  those  points  whose  Mahalanobis  distances 
from  the  mean  of  the  fc-set  using  the  covariance  of  the  fc- 
set  as  a  metric  are  smallest.  Because  Atkinson method 
is  completely  affine  equivariant,  we  concentrate  on  this 
rather  than  the  method  suggested  by  Hadi. 

Atkinson’s  method  is  affine  equivariant.  He  suggests 
restarting  the  procedure  many  times  with  randomly  se¬ 
lected  sets  of  p  H-  1  points.  For  each  trial,  sequential 
addition  is  performed  and  for  each  stage  in  the  sequen¬ 
tial  addition,  the  covariance  matrix  is  calculated,  and 
the  resulting  shape  matrix  is  expanded  (or  contracted) 
so  that  half  (or  (n-fp-|-l)/2)  of  the  points  are  included  in 
the  ellipsoid  defined  by  the  current  location  and  shape. 
The  estimate  over  all  trials  and  over  all  stages  of  each 
trial  in  which  the  scaled  shape  matrix  has  minimum  de¬ 
terminant  may  be  taken  as  the  robust  estimate  of  the 
shape  and  location  of  the  data.  Atkinson’s  algorithm  is 
a  large  improvement  over  MinVol.  In  the  remainder  of 
the  paper,  we  refer  to  this  procedure,  following  Atkinson, 
as  the  forward  algorithm,  or  Forward  for  short. 

We  found  that  including  a  sequential  addition  step 
between  Tabu  search  for  the  MCD  and  the  iterative  es¬ 
timator  improved  the  results  in  some  cases.  Here  the 
preliminary  MCD  estimator  is  used  to  choose  the  p  +  1 
points  closest  to  the  location  estimate  using  the  MCD 
metric,  and  then  sequential  addition  as  used  by  Atkin¬ 
son  proceeds  once,  yielding  a  new  location  and  shape 
estimator  that  is  then  use  to  start  the  iterations  for  the 
M-  or  5-estimator.  The  importance  of  including  the 
point  addition  sub-algorithm  is  reduced  if  the  contami¬ 
nation  is  further  away  from  the  good  data.  So,  although 
the  inclusion  of  the  point  addition  sub-algorithm  is  not 
critical,  it  seems  well  worth  the  small  effort  required  to 
code  it. 

3.4  Minimum  Covariance  Determinant 

Faced  with  a  subsample  of  contaminated  data,  our  exper¬ 
iments  indicate  that  the  best  way  to  find  a  good  starting 
point  for  sequential  point  addition  (or  for  M-iteration) 
is  to  search  for  the  MCD.  It  was  originally  thought  that 
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the  MVE  would  be  preferable  for  computational  reasons 
(see  Rousseeuw  and  Van  Zomeren  1990),  even  though 
the  MCD  has  greater  asymptotic  efficiency.  This  was 
based  on  the  notion  that  MVE  algorithms  would  make 
use  of  elemental  subsets.  Woodruff  and  Rocke  (1993a) 
demonstrated  that  heuristic  search  algorithms  that  use 
larger  subsample  sizes  perform  better.  Given  this  fact, 
there  is  no  longer  any  reason  to  prefer  the  MVE  to  the 
MCD.  Simulations  done  by  Woodruff  and  Rocke  (1993b) 
strongly  support  the  contention  that  the  MCD  is  in  fact 
the  better  estimator  to  use. 

The  MCD  for  any  set  of  data  is  defined  by  the  half 
sample  whose  covariance  matrix  has  minimum  determi¬ 
nant.  It  is  convenient  to  search  for  MCD  half-samples 
moving  from  half  sample  to  half  sample  by  the  removal 
of  one  point  in  the  current  half  sample  and  the  addi¬ 
tion  of  one  not  currently  in.  Neighborhoods  defined  in 
this  way  can  form  the  basis  of  a  steepest  descent  to  a 
local  minimum.  Hawkins  (1993b)  suggests  the  use  of 
steepest  descent  with  random  restarts,  which  he  calls 
FSA.  Woodruff  and  Rocke  (1993b)  advocate  the  use  of  a 
steepest  descent  based  meta-heuristic  called  tabu  search 
(Glover  1989,  1990).  A  tabu  search  (TS)  algorithm  for 
the  MCD  is  given  in  Rocke  and  Woodruff  (1994). 

3.5  A  Comparison  of  Algorithms 

Given  that  some  runs  in  high  dimension  may  take  up  to 
an  hour  of  CPU  time,  and  that  there  are  many  conditions 
under  which  one  should  compare  estimators,  a  compre¬ 
hensive  Monte  Carlo  study  is  impractical.  In  this  section, 
we  compare  our  algorithm  with  random  search  over  el¬ 
emental  subsets  (Rousseeuw  1985:  MinVol).  Compar¬ 
isons  with  the  forward  algorithm  (Atkinson  1992:  For¬ 
ward),  steepest  descent  with  random  restarts  (Hawkins 
1993b:  FSA),  and  SURREAL  may  be  found  in  Rocke  and 
Woodruff  (1994).  The  hybrid  algorithm  proved  to  be  su¬ 
perior  to  these  methods,  as  well  as  to  MinVol,  for  high 
dimension  or  large  data  sets. 

The  good  data  in  the  simulation  are  multivariate  stan¬ 
dard  normal;  the  bad  data  are  multivariate  normal  with 
covariance  I  but  with  a  mean  displaced  a  distance  of 
dQp^  where  values  d  =  2  and  d  =  4  were  used.  The 
dimension  p  was  10,  20,  and  50,  with  sample  sizes  of 
n  =  5p,  n  =  lOp,  and  n  =  20p.  Several  processing  times 
t  were  tried  for  each  case,  varying  from  a  few  seconds  to 
several  hours  in  high  dimensional  examples.  The  degree 
of  contamination  a  was  varied  from  levels  where  the  so¬ 
lution  could  almost  always  be  found  by  most  methods  to 
levels  where  none  of  the  methods  could  get  them  right. 

In  order  to  increase  the  utility  of  the  number  of 
runs  that  were  practical  to  perform,  a  generalized  lin¬ 
ear  model  was  fit  to  the  outcomes  of  the  experiments, 


Table  2:  Fitted  Performance  Measures  for  the  Hybrid 
Algorithm  vs.  MinVol  in  Dimension  10.  The  columns 
headed  “%”  are  the  predicted  percentage  of  trials  that 
the  given  estimator  correctly  identifies  the  outliers. 


a 

n 

time  (sec) 

Hybrid  % 

MinVol  % 

.1 

100 

100 

100.0 

69.2 

.1 

200 

400 

100.0 

88.9 

.2 

100 

100 

99.8 

11.0 

.2 

200 

400 

100.0 

15.9 

.3 

100 

100 

83.2 

0.7 

.3 

200 

400 

97.9 

0.4 

which  each  consisted  of  20  trials  at  each  case.  The  logit 
of  the  probability  that  a  given  estimator  would  succeed 
in  identifying  all  the  outliers  was  taken  to  be  a  linear 
function  of  n,  a,  and  log(^)  and  their  interactions  (non¬ 
significant  interactions  were  removed).  Different  models 
were  fit  for  each  estimator,  distance  of  outliers,  and  for 
each  dimension  examined. 

Table  2  shows  the  fitted  probability  of  success  for 
some  choices  of  the  amount  a  of  contamination,  the 
number  of  data  points,  and  the  estimation  time  for  the 
hybrid  algorithm  and  MinVol  in  dimension  10.  The 
clear  superiority  of  the  hybrid  algorithm  is  apparent.  In 
higher  dimension,  limited  trials  suggest  that  the  the  hy¬ 
brid  algorithm  is  even  more  dominant.  However,  given 
the  finite  time  available  for  computer  simulations  in  high 
dimension,  most  of  the  runs  were  devoted  to  determining 
the  envelope  of  feasible  solution  for  the  hybrid  algorithm, 
rather  than  to  documenting  the  exact  degree  of  superi¬ 
ority  over  competing  algorithms. 

4  Estimating  the  Envelope 

This  section  is  devoted  to  the  following  question:  for 
what  dimensions,  sample  sizes,  outlier  distances,  frac¬ 
tions  of  outliers,  and  computation  times  is  the  hybrid 
algorithm  effective?  The  theoretical  results  in  Woodruff 
and  Rocke  (1993b)  demonstrate  that  any  amount  of  con¬ 
tamination  less  than  50%  can  theoretically  be  handled 
with  sufficient  data  and  sufficient  processing  time.  Here 
we  ask  a  different  question:  what  amount  of  contamina¬ 
tion  can  be  practically  detected  with  an  amount  of  data 
that  is  given  and  with  practical  processing  times. 

Table  3  shows  some  results.  For  each  indicated  com¬ 
bination  of  dimension  and  outlier  distance,  a  generalized 
linear  model  was  fit  as  described  above.  Then  the  level  of 
contamination  was  found  that  allowed  a  predicted  90%  of 
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Table  3:  Critical  Contamination  Level  for  90%  success 
with  the  Hybrid  Algorithm,  The  column  headed  ot  is  the 
amount  of  contamination  such  that  the  hybrid  algorithm 
is  predicted  to  be  able  to  identify  the  outliers  correctly 
in  90%  of  the  instances. 


p 

d 

n 

time  (sec) 

a 

10 

2 

50 

200 

0.27 

10 

2 

100 

200 

0.29 

10 

2 

200 

200 

0.32 

10 

4 

50 

200 

0.29 

10 

4 

100 

200 

0.32 

10 

4 

200 

200 

0.36 

20 

2 

100 

800 

0.21 

20 

2 

200 

800 

0.24 

20 

2 

400 

800 

0.27 

20 

4 

100 

800 

0.24 

20 

4 

200 

800 

0.25 

20 

4 

400 

800 

0.28 

50 

2 

200 

5000 

0.15 

50 

2 

400 

5000 

0.16 

50 

2 

800 

5000 

0.17 

the  data  sets  to  be  successfully  completed.  To  avoid  un¬ 
due  extrapolation,  computation  times  and  sample  sizes 
were  set  to  within  the  bounds  of  what  were  used  for 
problems  of  that  nature  in  our  study. 

The  more  data  (and  the  more  computation  time), 
the  greater  the  fraction  of  outliers  that  can  be  handled. 
Within  our  self-imposed  bounds,  we  can  say  that  outlier 
fractions  in  the  30-35%  range  can  be  reliably  solved  in  di¬ 
mension  10,  with  20-30%  in  dimension  20  and  15-20%  in 
dimension  50.  Although  these  bounds  are  crude,  it  does 
give  some  feel  for  what  problems  are  feasible.  It  is  likely 
that  the  sample  sizes  and  processing  times  for  dimension 
50  are  actually  a  lot  too  small.  For  assured  success  with 
high  contamination,  substantially  larger  values  of  both 
than  the  ones  we  used  may  very  well  be  necessary. 

A  point  that  should  not  be  overlooked  is  that  advances 
in  processor  technology  and  parallel  processing  can  have 
an  important  effect.  For  example,  a  DEC  3000/400  Al¬ 
pha  AXP  workstation  is  about  6  times  faster  than  the 
DECStation  5000/200  on  which  these  simulations  were 
conducted,  and  multiple  processor  machines  could  also 
be  used  to  multiply  the  effectiveness  of  the  algorithm, 
which  is  parallelizable  in  a  number  of  ways  (Woodruff 
and  Rocke  1993a). 


5  Conclusions 

In  this  paper,  we  have  investigated  the  nature  of  multi¬ 
variate  outliers  and  methods  for  their  detection.  We  have 
shown  that  shift  outliers  provide  a  reasonable  testbed  for 
multivariate  outlier  detection,  being  difficult  but  not  im¬ 
possible  to  detect.  Using  this  testbed,  we  have  shown  a 
new  hybrid  algorithm  to  be  superior  to  existing  methods 
for  this  problem.  Given  sufficient  data  and  processing 
time,  even  heavily  contaminated  data  in  high  dimension 
can  be  dealt  with. 
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Abstract 

There  is  a  vast  literature  aimed  at  estimating  the 
number  of  species  in  a  diverse  population  and  the 
frequency  distribution  of  individuals  across  species. 
Regardless  of  the  model  or  estimating  method  cho¬ 
sen,  estimates  of  the  ‘^true  number”  of  species  typi¬ 
cally  have  a  very  large  variance.  A  new  model  pre¬ 
sented  here  separates  the  observed  class  distribution 
into  "well-defined”  and  “residual”  classes  with  differ¬ 
ent  generating  distributions. 

1  Introduction 

The  detection  of  abnormal  or  “novel”  behavior  in  dy¬ 
namic  mechanical  systems  is  an  emerging  area  of  crit¬ 
ical  importance  in  the  aerospace  industry.  Sensors  of 
acoustic  or  mechanical  vibrations,  optical  patterns, 
etc.  can  be  deployed  in  critical  locations  in  an  air¬ 
craft  or  space  vehicle,  and  the  resulting  time  series 
or  space-time  series  can  be  monitored  for  unusual 
changes  in  a  variety  of  ways.  This  complex  data  is 
typically  generated  at  a  very  high  rate,  in  very  large 
volume  or  both,  and  real-time  processing  is  desirable 
in  some  of  the  applications  envisioned.  Furthermore, 
the  dynamics  of  the  vehicle  are  too  complex  for  a 
model  based  on  physical  understanding  to  be  prac¬ 
tical.  In  this  environment,  engineers  are  seeking  to 
use  neural  networks  to  reduce  and  analyze  the  data 
generated  by  the  sensors. 

In  order  to  be  effective,  neural  networks  gener¬ 


ally  require  data  that  has  been  carefully  preprocessed 
(transformed).  Postprocessing  of  the  output  of  the 
neural  network,  to  evaluate  and  interpret  its  “anal¬ 
ysis,”  is  also  commonly  necessary.  The  composite 
species  distribution  model,  which  is  the  focus  of  this 
article,  was  motivated  by  the  need  to  develop  a  sen¬ 
sitive  statistical  performance  monitor  to  postprocess 
the  output  of  an  ARTl  neural  network  (Carpenter 
and  Grossberg  [2])  used  in  the  “novelty  detection”  en¬ 
vironment  described  above.  However,  this  model  may 
be  used  with  any  pattern  recognition  algorithm,  and 
in  fact  with  any  classification  system.  The  composite 
model  may  also  prove  valuable  in  ecology  and  genet¬ 
ics,  and  language  suggestive  of  these  applications  is 
used  whenever  appropriate. 

2  Background:  Neural  Net¬ 
work  Novelty  Detection 

In  the  typical  novelty  detection  application,  the 
ARTl  network  is  presented  with  long  sequences  of 
patterns  generated  from  relatively  short  subsamples 
(windows)  of  the  time  series  in  question.  The  choice 
of  window  sizes,  window  overlap,  and  preprocess¬ 
ing  is  highly  application-dependent.  ARTl  creates 
a  binary-coded  classification  of  these  incoming  pat¬ 
terns  “on  the  fly”  by  creating  internal  representations 
(templates)  of  new  patterns  as  they  are  presented. 
New  patterns  may  classify  to  an  existing  template  if 
they  are  close  enough  according  to  a  Hamming  met- 
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ric  which  is  highly  dependent  on  the  preprocessing 
selected.  Previously  generated  templates  can  be  mod¬ 
ified  to  some  degree  by  subsequent  patterns.  Patterns 
which  do  not  match  any  existing  template  generate  a 
new  template. 

When  the  rate  of  generation  of  new  templates  slows 
down  or  stops,  the  network  is  “trained”  to  recognize 
the  normal  behavior  of  the  suite  of  sensors  it  is  mon¬ 
itoring.  Abnormal  or  novel  behavior  may  be  indi¬ 
cated  by  a  sudden  increase  in  the  generation  of  new 
templates,  but  in  some  cases  a  more  sensitive  indica¬ 
tor  may  be  needed.  In  many  applications,  the  rate 
of  activation  of  at  least  some  of  the  more  “popular” 
templates  encoded  during  learning  appears  to  be  rel¬ 
atively  stable.  The  qualitative  use  of  pattern  activa¬ 
tion  data  is  discussed  and  illustrated  in  the  novelty 
detection  context  in  Newman  and  Caudell  [11]. 

The  ARTl  network  itself  will  not  be  discussed  fur¬ 
ther  in  this  article,  and  the  reader  is  referred  to  [11] 
and  [2]  for  the  specifics  of  ARTl  and  to  surveys  of  the 
general  neural  network  literature  such  as  Grossberg 
[6]  or  Hecht-Nielsen  [7]  for  further  information. 

3  Motivation:  Estimating  the 
Number  of  Species 

The  basic  data  used  to  monitor  the  neural  network 
after  N  patterns  have  been  processed  is  the  vector  of 
activation  frequencies  (or  distribution  of  observed  in¬ 
dividuals  among  species  or  classes)  n  =  (ni, . . , ,  n^), 
where  c  is  the  number  of  classes  which  have  actu¬ 
ally  been  observed  (or  templates  which  have  been 
created).  It  is  reasonable  to  represent  the  data  as  a 
sample  from  a  multinomial  distribution  with  a  fixed 
but  unknown  number  of  classes  C,  But  now  C  is 
a  parameter  to  be  estimated,  and  standard  methods 
for  estimating  the  class  probabilities  p  =  (pi, . . .  ,Pc) 
and  testing  hypotheses  based  on  the  multinomial  as¬ 
sumption  with  C  known  a  priori  are  no  longer  ade¬ 
quate. 

Bunge  and  Fitzpatrick  [1]  review  the  literature 
on  estimation  of  the  “true”  number  C  of  species  or 
classes  from  a  sample  of  size  N,  Direct  estimation  of 
C  is  frequently  difficult,  even  when  unrealistic  simpli- 


1  Estimates  of  u  || 

N 

MLE 

Good 

Starr 

NPMLE 

100 

.012 

.015 

1000 

1.00 

■Bn 

.004 

.006 

7440 

.9999 

liMi 

.0007 

.0024 

Estimates  of  C 

c 

MLE 

Good 

Starr 

Chao 
k.  Lee 

5 

5 

(222) 

5 

423 

5 

[0] 

29 

29 

(12868) 

29 

7217 

29 

[0] 

86 

86.01 

(7.5x10®) 

86.01 

12797 

86.01 

[0] 

Table  1:  Estimates  of  coverage  u  and  true  number  of 
classes  C  using  several  methods,  for  various  values  of 
sample  size  N  and  corresponding  numbers  c  of  classes 
observed 

fying  assumptions  are  made,  causing  many  authors 
to  base  estimates  of  C  on  estimates  of  the  coverage 
u  of  the  sample.  Coverage  is  the  true  proportion  of 
the  population  represented  in  the  sample;  formally, 
u  =  where  for  notational  simplicity  it 

is  assumed  that  the  indices  of  the  observed  classes 
2  =  1, . . . ,  c  are  consistent  with  the  indexing  of  p,  and 
c  <  C. 

Experiments  have  been  conducted  with  neural  net¬ 
work  output  using  several  of  the  multinomial,  infinite 
population  methods  cited  in  [1].  Table  1  shows  some 
typical  results.  Numbers  in  parentheses  under  the 
point  estimates  are  estimates  of  standard  deviations. 
Chao  and  Lee  [3]  gave  estimators  for  the  coefficient 
of  variation  of  their  estimator  C;  these  are  shown  in 
brackets. 

The  maximum  likelihood  estimate  (MLE),  Good 
and  Chao  estimates  of  u  and  C  and  Good’s  estimate 
of  the  standard  deviation  of  u  indicate  that  there 
are  no  new  classes  to  be  discovered,  while  the  Starr, 
non-parametric  MLE  (NPMLE)  and  MLE  standard 
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deviation  of  C  indicate  that  only  a  tiny  fraction  of 
the  classes  have  been  discovered.  The  lack  of  agree- 
ment  among  the  methods,  and  the  inability  of  any 
of  the  estimates  based  on  {N  —  100,  c  =  5)  or 
{N  =  1000,  c  =  29)  to  predict  future  behavior  un¬ 
dermines  the  credibility  of  the  fixed-but-unknown  C 
multinomial  model.  It  seems  essential  to  regard  C  as 
a  random  variable. 

Returning  to  the  review  by  Bunge  and  Fitzpatrick 
[1],  methods  which  approximate  the  histogram  of  p  by 
some  kind  of  parametric  model  appear  to  give  better 
results  in  practice  than  those  which  treat  p  as  the 
primary  vector  of  parameters,  according  to  the  au¬ 
thors.  The  results  of  Keener,  Rothman  and  Starr  [9] 
are  especially  pertinent  in  light  of  the  considerations 
above.  Following  a  number  of  previous  investigators, 
they  place  a  symmetric  Dirichlet  distribution 

Prob{p\a,C)  =  (1) 

on  p,  although  they  still  treat  the  number  of  classes 
C  as  a  fixed  parameter;  their  approach  is  empiri¬ 
cal  Bayesian.  They  consider  estimation  of  C  with 
a  known,  and  also  with  a  unknown,  the  latter  being 
the  more  realistic  assumption.  (Keener  et  al  [9]  use  m 
instead  of  (7,  and  A  instead  of  cr.)  They  tested  their 
methods  with  numerous  examples,  and  found  that 
quite  often  one  obtained  estimates  in  which  a  — >  oo 
or  C  oo  while  aC  approached  a  finite  limit  which 
will  here  be  called  6  in  what  follows.  In  particular 
they  show  that  in  the  more  common  case  in  which 
C  00  while  a  — 0  but  aC  — ^  the  limiting  dis¬ 
tribution  of  c  in  this  case  is  given  by  the  Ewens  [5] 
sampling  distribution 

Probic\e)  =  ^\S^‘\  (2) 

where  {$)n  =  —  1).  ..(0  —  TV  -f  1)  and  is 

the  Stirling  number  of  the  first  kind.  Furthermore, 
the  total  number  of  classes  c  observed  is  a  sufficient 
statistic  for  0.  Further  support  for  the  use  of  Ewens 
distribution  may  be  found  in  Table  1:  note  the  be¬ 
havior  of  Starr’s  [12]  estimate  and  the  nonparametric 
maximum  likelihood  estimate  [4],  which  suggest  that 
large  values  of  C  are  possible. 

Informal  statistical  analyses,  including  graphical 
examination  of  the  template  activation  pattern,  and 
numerical  examination  of  changes  in  the  “Pareto  his¬ 
togram”  of  pattern  activations  as  N  increases,  suggest 
that  a  subset  of  the  classes  (in  particular  the  most 


frequently  occurring  classes)  behave  like  a  standard 
multinomial  distribution,  while  the  rate  of  activation 
of  many  templates  is  rather  erratic.  While  the  rate 
of  addition  of  new  templates  generally  slows  down,  it 
does  not  always  cease,  and  will  suddenly  increase  if 
some  “new”  phenomenon  occurs  in  the  data. 

The  population  biology  analogue  of  this  phe¬ 
nomenon  would  occur  when  some  species  are  dom¬ 
inant  (not  necessarily  in  numbers,  as  in  the  neural 
network  application),  while  many  species  are  in  some 
kind  of  competitive  equilibrium.  In  genetics,  highly 
selected  alleles  may  follow  one  pattern,  while  neutral 
alleles  (as  in  Ewens  [5])  follow  another. 

4  The  Composite  Species  Dis¬ 
tribution  Model 

These  observations  motivated  the  model  presented 
here,  which  is  a  compromise  between  a  “full  multi¬ 
nomial”  model  and  a  limiting  Ewens  distribution.  It 
is  presented  in  Bayesian  terms,  although  clearly  an 
empirical  Bayes  interpretation  is  possible.  The  total 
number  of  observed  classes  c  is  divided  conceptually 
into  w  “well-defined”  and  r  “residual”  classes.  As  a 
consequence,  the  total  number  of  patterns  N  is  par¬ 
titioned  by  the  choice  of  w  into  TV  =  -h  iVr ,  the 

number  of  well-defined  and  residual  patterns,  respec¬ 
tively.  A  Dirichlet  prior  distribution  is  imposed  on 
the  multinomial  probabilities  q  =  (^'i, . . . , ?u;-hi) 
associated  with  the  it;  -h  1  classes  formed  by  lumping 
all  the  residual  classes  together  into  a  single  class,  and 
appending  it  to  the  well-defined  classes. 

The  conditional  distribution  of  r  given  w  is  then 
given  by  the  Ewens  distribution  (2),  with  r  replac¬ 
ing  c,  and  TVr,  the  number  of  patterns  classified  to 
residual  classes,  replacing  TV,  in  (2).  An  approximate 
natural  conjugate  prior  on  9  has  been  adopted: 

Prob(9\r„,no)  oc  (3) 

The  hyperparameters  r©  and  can  be  any  small 
numbers  (r^  <  n^);  the  details  of  this  prior  are  over¬ 
whelmed  by  the  data  in  the  examples  which  have  been 
investigated. 

One  of  three  “reference”  or  “informative”  prior  dis¬ 
tributions  Prob{w)  is  imposed  on  w.  Following  the 
example  of  York  and  Madigan  [13],  the  parameter  w 
is  regarded  as  defining  the  class  of  models  of  inter¬ 
est.  The  three  priors  are  Jeffreys,  Rissanen  and  the 
negative  binomial;  see  [13]  for  details. 
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Finally,  the  choice  of  w  is  not  treated  as  simply 
a  matter  of  parameter  estimation,  but  rather  one  of 
model  uncertainty  ([8],  [10]).  The  “Occam’s  window” 
approach  of  Madigan  and  Raftery  [10]  is  used  to  re¬ 
strict  the  number  of  values  of  w  which  are  considered 
reasonable  alternatives,  and  a  posterior  distribution 
over  w  is  computed  on  this  restricted  set.  The  ratio¬ 
nale  is  that  it  is  important  to  know  whether  or  not  the 
number  of  well-defined  classes  is  itself  well-defined  by 
the  data,  or  not;  the  answer  is  expected  to  be  highly 
application  specific. 
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The  site  at  which  a  nascent  polypeptide  is  cleaved 
and  a  glycosyl-phosphatidylinositol  (GPI)  anchor  is  at¬ 
tached  is  an  important  and  difficult  issue  in  biochemical 
research.  Recently,  two  alternative  methods  to  deter¬ 
mine  the  locus  of  cleavage  (u)  have  been  proposed.  One 
is  based  on  elegant  experimental  biochemical  and  molec¬ 
ular  studies  and  assigns  a  p non  probabilities  to  possible 
cleavage  sites.  The  other  method  uses  data  from  the 
amino  acid  frequency  analysis  and  the  nascent  polypep¬ 
tide  amino  acid  sequence  to  predict  the  locus  of  cleav¬ 
age  using  a  Chi-square  statistic,  but  ignores  prior  in¬ 
formation  from  the  biochemical  studies.  In  this  paper, 
we  propose  a  Bayesian  approach  for  this  inference  which 
synthesizes  both  methods.  This  allows  probability  state¬ 
ments  regarding  predictions  of  cleavage  sites  to  be  made. 
S  code  for  such  analyses  is  described  and  an  example  il¬ 
lustrating  the  impact  of  different  levels  of  prior  informa¬ 
tion  is  provided. 

1.  INTRODUCTION 

Although  glycosyl-phosphatidylinositol  (GPI)  anchor¬ 
ing  is  only  a  recently  recognized  form  of  attachment  of 
proteins  to  cell  membranes,  over  50  such  proteins  have 
been  identified.  The  synthesis  of  proteins  destined  to 
be  GPI  anchored  is  a  complex  process.  (Biological  sys¬ 
tems  create,  transport  and  utilize  many  different  types 
of  proteins  which  can  be  thought  of  as  strings  of  amino 
acids.)  The  newly  translated,  or  nascent,  full  length 
polypeptide  is  further  processed  by  removal  of  portions 
of  the  amino-terminal  domain  and,  possibly,  addition  of 
carbohydrates  or  lipids.  Nascent  polypeptides  destined 
for  GPI  anchoring  contain  features  in  their  carboxyl- 
terminal  amino  acid  sequence  which  are  recognized  by 
a  specialized  enzyme  called  GPI-transamidase.  GPI- 
transamidase  changes  the  polypeptide  in  two  ways:  (1) 
it  endoproteolytically  cleaves  a  hydrophobic  amino  acid- 
rich  portion  of  the  carboxyl  end,  and  (2)  it  attaches 
a  lipid-rich  preformed  GPI  anchor.  The  mature  GPI 
protein  is  therefore  always  a  truncated  version  of  the 
nascent  polypeptide.  These  final  processing  steps  give 
the  mature  protein  its  characteristics  of  attachment  to 
cell  membranes  and  probably  play  a  role  in  its  function. 

Perturbation  of  the  normal  mechanism  of  GPI  anchor¬ 


ing  of  important  proteins  synthesized  in  hematopoietic 
cells  is  the  basis  for  a  severe  life  threatening  hematologi¬ 
cal  disease  known  as  paroxysmal  nocturnal  hemoglobin¬ 
uria,  Thus,  for  this  as  well  as  other  GPI  anchored  pro¬ 
teins,  the  effects  of  altering  the  natural  locus  of  GPI  at¬ 
tachment  are  of  particular  interest  to  the  researcher,  but 
this  locus  must  be  determined  first.  By  the  time  such 
research  is  contemplated,  the  cDNA  deduced  amino  acid 
sequence  of  the  nascent  polypeptide,  as  well  as  the  amino 
acid  analysis  of  the  mature  GPI  anchored  protein,  are  of¬ 
ten  available.  Unfortunately,  the  locus  at  which  the  GPI 
transamidase  cleaves  the  nascent  polypeptide  is  not  ob¬ 
vious  from  this  information  alone.  Determining  the  locus 
of  cleavage  is  labor  intensive  from  the  biochemical  point 
of  view,  often  involving  several  months  of  painstaking 
and  expensive  analysis. 

Recently,  however,  two  alternative  methods  to  deter¬ 
mine  the  locus  of  cleavage  have  been  proposed.  Antony 
and  Miller  (1994)  used  data  from  the  amino  acid  fre¬ 
quency  analysis  and  the  nascent  polypeptide  amino  acid 
sequence  to  predict  the  locus  of  cleavage  using  a  Chi- 
square  statistic.  The  other  method,  based  on  elegant 
biochemical  and  molecular  studies  by  Kodukula,  Gerber, 
Amthauer,  Brink  and  Udenfriend  (1993),  assigns  a  pri¬ 
ori  probabilities  to  possible  cleavage  sites.  We  propose 
a  Bayesian  approach  for  this  inference  which  synthesizes 
both  methods.  The  intention  of  the  present  research  was 
to  combine  these  two  methods  and  allow  for  the  inclusion 
of  replicate  amino  acid  frequency  data  if  available.  This 
paper  describes  the  S  software  written  to  implement  this 
new  method. 

2.  LIKELIHOOD  AND  PRIOR 

During  an  amino  acid  frequency  analysis  of  the  ma¬ 
ture  protein,  bonds  between  the  amino  acids  are  bro¬ 
ken  and  the  relative  frequency  of  acids  present  is  de¬ 
termined.  Adding  a  standard  to  the  sample  provides  a 
measure  of  scale  so  that  the  total  number  of  amino  acids 
in  each  molecule  can  be  estimated  (see  Table  1).  The 
process  is  not,  however,  without  error.  It  can  destroy 
the  amino  acid  tryptophan.  It  may  also  fail  to  distin¬ 
guish  asparagine  from  aspartic  acid,  or  glutamine  from 
glutamic  acid.  As  a  result,  the  frequencies  are  not  per¬ 
fectly  determined:  systematic  as  well  as  random  errors 
occur. 
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Table  1:  Results  from  an  amino  acid  frequency  analysis. 
Amino  Acid  Frequency  Proportion 


A 

1.9 

0.02 

D  or  N 

6.3 

0.07 

F 

0.0 

0.00 

G 

9.1 

0.10 

I 

0.1 

0.00 

K 

5.4 

0.06 

L 

1.0 

0.01 

p 

32.8 

0.37 

Q  or  E 

26.9 

0.30 

S 

1.0 

0.01 

T 

3.8 

0.04 

W 

7 

7 

V 

1.0 

0.01 

Total 

89.1 

1.00 

When  the  sequence  of  the  complete  protein  is  known, 
the  amino  acid  frequency  analysis  can  be  used  to  make 
inference  on  the  cleavage  point.  Antony  and  Miller 
(1994)  described  "computerized  exoproteolytic  cleav¬ 
age”,  which  computes  and  recomputes  the  frequency  of 
amino  acids  from  the  known  complete  sequence  after 
deleting  one  acid  from  the  carboxyl-terminal  end  at  a 
time.  Inference  for  the  cleavage  point  can  then  be  done 
by  comparing  each  theoretical  frequency  analysis  to  the 
observed  amino  acid  analysis  using  a  Chi-square  statistic 
to  compare  theoretical  and  observed  frequencies.  This 
method  enjoys  a  high  degree  of  success  for  prediction; 
however,  the  Chi-square  approach  is  not  as  flexible  as  a 
likelihood  analysis  and  is  not  well  suited  for  the  inclusion 
of  prior  information. 

Information  from  the  amino  acid  analysis  can  be 
thought  of  in  two  parts:  (1)  the  proportions  of  each 
amino  acid  and  (2)  the  total  number  of  amino  acids.  We 
will  assume  these  two  are  independent.  Let  K  be  the 
number  of  different  amino  acids  in  the  nascent  polypep¬ 
tide.  Denote  the  proportions  from  the  amino  acid  analy¬ 
sis  by  p  =  (pi, ... ,  pk),  and  the  total  number  by  T.  De¬ 
note  the  theoretical  frequencies  by  =  (n^;!, . , . ,  no/ic), 
and  the  sum  of  the  theoretical  frequencies  ^  , 

where  u  =  1, ...,  i\r  is  a  possible  cleavage  locus.  Then 
X(ci;;p, T")  =  L(tiu}]p)  X  L{Nu}]T) 

For  observed  proportions,  we  assume  a  Dirichlet  dis¬ 
tribution  parameterized  as 

K 

L(n^;p)=  S-'K  +  l) 

*  =  1 


where  B  is  the  generalized  beta  function.  This  implies 
E  (pi)  =  instead  of  E  (p,)  =  n^^i/Nu, 

which  shrinks  the  observed  proportions  towards  l/K. 
For  the  total,  a  good  model  is  harder  to  specify.  One 
choice  is 

L{N^-,T)  =  exp{-N^) 

which  is  a  gamma  distribution  with  mean  T  and  variance 
T;  that  is,  shape  parameter  T  and  scale  parameter  set 
to  1.  If  the  observed  frequencies  are  all  gamma  with 
scale  parameter  1,  then  the  proportions  are  Dirichlet  and 
the  total  is  gamma  with  scale  parameter  1.  We  do  not 
consider  direct  modeling  of  the  variability  in  observed 
frequencies,  since  replicate  amino  acid  analyses  are  not 
typically  available.  If  such  data  were  available,  it  would 
be  straightforward  to  estimate  the  scale  parameter  in  the 
gamma  distribution. 

Kodukula  et  al.  (1993)  conducted  empirical  studies  to 
determine  not  only  which  amino  acids  could  serve  as 
cleavage  points  in  GPI  proteins,  but  also  which  amino 
acids  could  be  adjacent  to  cleavage  points.  They  pre¬ 
sented  a  table  for  the  propensity  of  each  amino  acid  to 
be  at  the  cleavage  site,  and  one  or  two  amino  acids  from 
the  cleavage  site  (towards  the  carboxyl-terminal).  The 
"probability”  of  a  specific  amino  acid  being  the  cleavage 
site  is  the  appropriate  product  from  Table  2.  For  exam¬ 
ple,  if  serine  (S)  is  at  a  given  location,  and  the  two  loca¬ 
tions  towards  the  carboxyl-terminal  are  arginine  (R)  and 
alanine  (A),  the  "probability”  is  1.0  x  0.5  x  1.0  =  0.5.  We 
can  compute  such  products  for  each  site  in  the  nascent 
polypeptide,  and,  after  standardizing,  treat  the  collec¬ 
tion  as  a  prior  probability  for  cleavage  site.  We  denote 
the  prior  distribution  pr(a;).  Although  this  has  some 
weaknesses,  it  provides  a  great  deal  of  quantitative  in¬ 
formation  about  cleavage  sites.  Kodukula  et  al.  sug¬ 
gest  that  this  prior  alone  correctly  identifies  the  cleavage 
point  about  75%  of  the  time. 

The  posterior  probability  for  the  cleavage  site  occur¬ 
ring  at  a  given  point  in  the  sequence  is  the  product  of 
the  likelihood  and  the  prior  distribution.  That  is, 

Pt(u!\p,T)  =  L{w]p,T)  X  pr(w) 

where  L(a;;p,T)  is  the  product  of  the  Dirichlet  and 
gamma  distributions  described  above.  This  posterior 
can  be  used  to  make  predictions  for  the  cleavage  site 
and  attach  probability  statements  to  those  predictions. 

3.  IMPLEMENTATION  IN  S 

We  first  describe  how  data  for  this  problem  are  stored 
in  S  and  define  some  special  functions.  Procyclic  acidic 
r  jetitive  protein  (parp)  in  T.  brucei  will  be  used  as  a 
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Table  2:  Relative  propensity  as  cleavage  site;  Table  II 
from  Kodukula  et  al.  (1993) 


Amino  Acid 

UJ 

CJ  -f  1 

w  “f”  2 

A 

0.4 

1.0 

1.0 

R 

ND 

0.5 

ND 

N 

0.8 

ND 

ND 

D 

0.4 

0.4 

0.1 

C 

0.2 

0.3 

0 

Q 

0 

0.1 

ND 

E 

0 

0.4 

0 

F 

ND 

ND 

ND 

G 

0.4 

ND 

0.7 

H 

ND 

ND 

0 

I 

ND 

ND 

ND 

L 

0.1 

ND 

ND 

K 

0 

ND 

ND 

M 

0 

0.3 

ND 

P 

0 

0 

0 

S 

1.0 

0.6 

0.3 

T 

0 

0.3 

0.1 

W 

0 

0.1 

ND 

Y 

0 

ND 

ND 

V 

0.1 

ND 

0.1 

running  example.  The  known  full  sequence  is  stored  as 
a  character  vector,  with  the  carboxyl-terminal  first: 


parp 

>  <- 

rev 

(c( 

"A” , 

iigii 

,"P'* 

,"E" 

MT  II 
>  ^  > 

,"K'‘ 

"G" 

’•G", 

mrm 

,"G" 

,”G" 

,’'E" 

,"G'’ 

M»PII 

„RM 

/‘V 

11311 

*’A", 

nj)M 

>’'D" 

MTM 

,"G” 

,’'T" 

itpn 
>  *  > 

"D" 

npi, 

M£M 

‘T", 

,"P" 

,"E‘’ 

,"P" 

/‘E" 

^npi, 

,"E*‘ 

MpM 
>  *  f 

"E" 

,"P'’ 

Mgll 

"P”, 

"E" 

,"P" 

,“E" 

,*’E" 

,*'P" 

,'’E'’ 

tipil 

1  *  y 

M£M 

,'‘P” 

M£M 

’’P", 

^i.pM 

,"E" 

/’P" 

,"E" 

^npt, 

,"E" 

tipil 
»  *  > 

•‘E’l 

,"P" 

M£M 

*'P", 

"E” 

/‘P" 

,"E» 

,'’P*’ 

/’E” 

,''P" 

,"E" 

llpM 
y  *  y 

"E" 

,'‘P" 

M£M 

•’P", 

"E" 

,"P» 

,'’E” 

,*'P" 

,"E" 

,"P*' 

iipii 

"G" 

,"A»' 

"A" 

“L" 

,"S" 

,"V" 

,"A" 

,"L*' 

,"P'* 

."F", 

"A*' 

’•A" 

"A" , 

••A" 

,"A" 

/’L" 

,'’V' 

/’A" 

/•A" 

,”F" 

)) 

The  amino  acid  analysis  is  stored  as  a  numeric  vector 
with  labels  distinguishing  the  amino  acid  types: 

parp. analysis  <- 
c(1.9,  6.3,  0.0,  9.1,  0.1,  5.4, 

1.0,  32.8,  26.9,  1.0,  3.8,  1.0) 
naimes (parp. analysis)  <- 
c ( "A" , "DN" , "F" , "G" , "I" , "K" , 

"L" , "P" , "QE" , "S" , "T" , "V") 

The  protein  as  it  appears  to  the  analysis  is  stored  a 
factor  object  in  S:  levels  (types  of  amino  acid)  with  zero 
frequency  appear  in  output  from  the  S  function  table 
and  “invisible”  levels  (e.g.  tryptophan)  can  be  excluded. 


parp. alt  <-  rev(factor(c( 

"A" , "QE" , "G" , "P" , "QE" , "DN" , "K" , "G" , "L" , "T" , "K" , 

"G" , "G" , "K" , "G" , "K" , "G" , "QE" ,"K" , "G" , "T" , "K" , 

"V" , "S" ,"A" , "DN" , "DN" , "T" , "DN" , "G" , "T" , "DN" , "P" , 
"DM" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , 
"P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , 
"QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , 
"P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , 
"QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , "P" , "QE" , 
"P" , "QE" , "P" , "QE" , "P" , "G" , "A" , "A" , "T" , "L" , "K" , "S" , 
"V" , "A" , "L" , "P" , "F" , "A" , "I" , "A" , "A" , "A" , "A" , "L" , 

" V" , " A" , " A" , "F" ) , exclude="W" ) ) 


Computerized  endoproteolytic  cleavage  is  done  using 
the  table  function.  For  the  entire  nascent  polypeptide, 
deletion  of  the  first  amino  acid,  and  deletion  of  the  first 
30  amino  acids; 


>  table (parp. alt) 
ADHFGIKL  PQESTV 

12  629174  33  32  253 

>  table (parp. alt [-seq( 1)3) 
ADNFGIKL  PQESTV 

12  619174  33  32  253 

>  table(parp.alt[-seq(30)3) 

ADNFGIKL  PQESTV 
2  608061  28  29  141 


The  function  seq(i)  produces  the  sequence  1,2,  ...,i. 
Negative  indices  in  the  subset  operator  []  delete  obser¬ 
vations.  Thus  the  function  table (paurp. alt  [-seq(i)]  ) 
produces  the  theoretical  frequency  analysis  for  the  pro¬ 
tein  with  cleavage  site  “i” .  The  entire  set  of  theoretical 
frequencies  can  be  determined  in  two  lines: 

>  parp.cec  <-  table (parp. alt) 

>  for  (i  in  seq(parp)) 

+  parp.cec  <- 

cbind(parp . cec ,  table(parp . alt  C-seq(i)] ) ) 
The  first  ten  columns  of  the  resulting  matrix  are: 

>  parp.cec[,seq(10)] 

parp. cec 

A  12  12  11  10  10  10  9  8  7  6 

DN  6666666666 

F  2111111111 

G  9999999999 

I  1111111111 

K  7777777777 

L  4444433333 

P  33  33  33  33  33  33  33  33  33  33 

QE  32  32  32  32  32  32  32  32  32  32 

S  2222222222 

T  5556655556 

V  3333222222 
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A  function  to  evaluate  the  Dirichlet  density  is: 

>  ddirich 

functionCp,  n,  ...) 

{ 

#  Dirichlet  density 

if(length(n)  !=  length(p)) 

stop('*length(n)  must  be  length(p)**) 
if(sum(p,  ...)  >  1  +  std. tolercUiceO)  { 
warning  (”  sum  of  p  >  1'*) 

0 

} 

else  prod(p"n,  . . . )/gbeta(n+l ,  ...) 

} 

The  Dirichlet  part  of  the  likelihood  can  be  computed  for 
a  given  u)  as 

>  parp. proport ions  <- 

parp .  analysis/sum(pcLrp,  analysis) 

>  ddirich (parp. proport ions,  parp.cec[,21] ) 

[1]  2.803521e+15 

For  the  entire  protein 

>  parp.dlik  <- 

apply (parp . c ec , 2 , ddirich ,  p=parp . proport ions ) 
The  gamma  part  of  the  likelihood  is  computed  similarly: 

>  args(dgamma) 

function (x,  shape  =  stop ("no  shape  arg")) 

>  dgamma(sum(parp. analysis),  seq(parp)) 

[1]  1.650072e-39  1.473514e-37  6.579239e“36 

>  parp.glik  <~ 

rev(dgainma(sum( parp.  analysis)  ,  seq(parp))) 

To  utilize  the  prior  from  Udenfriend  and  co-workers 
at  Roche  Labs,  several  functions  taking  the  complete, 
known  sequence  as  an  argument  were  written.  Four  in¬ 
creasingly  complex  versions  were  implemented,  rochel 
assigns  equal  prior  probability  to  all  acids  in  the  set  (A, 
N,D,C,G,L,S,  V),  roche2  assigns  prior  probability  in 
proportion  to  the  second  column  in  Table  2  (the  u)  site 
only).  rocheS  takes  into  account  the  lo  and  a;  +  2  sites, 
but  ignores  the  u  +  I  site.  roche4  implements  the  full 
scheme  using  all  three  sites.  In  the  last  three  functions, 
amino  acids  listed  as  “not  done”  (ND  in  Table  2)  were 
assigned  a  value  0.1,  and  zeros  were  set  to  0.01. 

A  convenient  summary  of  this  analysis  is  graphical, 
and  we  have  written  a  specialized  plotting  function  called 
plot . summary. 

4.  EXAMPLES 

We  present  an  example  with  known  sequence  and 
cleavage  point:  parp.  Inspection  of  the  likelihood  as 


Figure  1:  Posterior  for  parp  using  the  likelihood  only. 

parp 


well  as  biological  considerations  restricts  the  range  of  in¬ 
terest  to  no  more  than  the  60  amino  acids  closest  to  the 
carboxyl-terminal.  Both  the  likelihood  alone  and  the 
posterior  using  any  of  the  four  priors  correctly  identify 
the  locus  of  cleavage. 

Figure  1  shows  the  likelihood  analysis  for  parp.  The 
x-axis  is  the  number  of  amino  acids  removed  by  com¬ 
puterized  exoproteolytic  cleavage,  and  the  y-axis  is  the 
posterior  probability  under  a  noninformative  prior  (equal 
prior  probability  on  all  loci).  The  dots  indicate  the  like¬ 
lihood  based  on  the  proportions  from  the  amino  acid 
frequency  analysis,  while  the  solid  line  indicates  the 
full  likelihood.  At  the  top  of  the  graph,  symbols  for 
each  amino  acid  in  the  nascent  polypeptide  are  shown 
(the  carboxyl-terminal  amino  acid  end  is  to  the  left). 
The  likelihood  has  a  maximum  at  the  22nd  amino  acid, 
glycine,  which  is  the  actual  cleavage  site  for  parp. 

Figures  2,  3  and  4  represent  the  impact  of  adding  in¬ 
formative  prior  information  to  the  likelihood  analysis, 
using  rochel,  roche2,  and  roche4.  In  all  three,  the 
posterior  concentrates  more  probability  on  the  correct 
location,  but  the  posterior  (dashed  line)  based  on  the 
full  specification  suggested  by  Kodukula  et  al.  (roche4) 
puts  90%  probability  on  this  location. 

5.  DISCUSSION 

The  Bayesian  approach  we  propose  has  a  number  of 
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Figure  2:  Posterior  for  parp  using  the  rochel  prior. 

parp 


Figure  3:  Posterior  for  parp  using  the  roche2  prior 

parp 


Figure  4:  Posterior  for  parp  using  the  roche4  prior. 

parp 


advantages  over  existing  methods.  It  allows  both  prior 
information  and  the  amino  acid  frequency  data  to  con¬ 
tribute  to  the  inference.  Replicate  frequency  data,  which 
may  exist  though  it  is  not  published,  can  be  incorporated 
directly.  By  producing  a  posterior  distribution,  proba¬ 
bilistic  statements  concerning  most  likely  cleavage  loci 
can  be  made:  this  is  important  in  distinguishing  cases 
where  the  prediction  is  highly  certain. 

The  flexibility  and  presentation  quality  available  from 
S  makes  it  an  ideal  computing  environment  for  this  prob¬ 
lem.  The  functions  written  for  the  Bayesian  analysis  can 
be  immediately  applied  to  other  proteins  once  the  data 
are  input  appropriately.  Alternative  specifications  for 
the  likelihood  or  prior  are  easily  programmed,  as  are 
customizations  of  the  graphical  summary.  Since  there 
appears  to  be  some  sensitivity  to  the  prior,  rapid  recom¬ 
putation  and  redisplay  is  particularly  useful. 
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Abstract 

The  magnitude  of  the  effect  of  deleterious  genes 
on  a  population  is  classically  characterized  by 
the  number  of  lethal  equivalents.  In  conservation 
and  breeding  programs,  it  is  often  important  to 
be  able  to  distinguish  among  different  combina¬ 
tions  of  the  genetic  parameters  that  lead  to  the 
same  number  of  lethal  equivalents,  for  instance,  a 
large  number  of  mildly  deleterious  genes  or  a  few 
of  fully  lethals.  This  requires,  at  least,  two  con¬ 
secutive  generations  of  mating.  Because  of  the 
complexity  of  the  likelihood  and  the  existence  of 
many  missing  data  in  this  two  generation  case, 
Bayesian  Markov  chain  simulation  is  used  to  in¬ 
fer  these  parameters  and  the  missing  data.  In 
our  Markov  chain  Monte  Carlo  approach,  we  in¬ 
troduce  a  Metropolis-Hastings  algorithm  for  a 
two  dimensional  update  of  parameters  having 
highly  attenuated  posterior  density. 

Key  Words:  Deleterious  genes,  Lethal  equiv¬ 
alents,  Bayesian  Markov  chain  simulation, 
Metropolis-Hastings  algorithm. 

1  Introduction 

The  overall  effect  of  deleterious  genes  on  a  popu¬ 
lation  is  classically  characterized  by  the  number 
of  lethal  equivalents  (Morton,  Crow,  and  Muller, 
1956).  In  a  diploid  population  where  deleterious 
alleles  exist  at  M  loci  with  allele  frequency  qi 
and  selection  coefficient  Si  at  the  i-th  locus,  the 
number  of  lethal  equivalents  at  the  gametic  level 

*  North  Central  Forest  Experiment  Station  and  De¬ 
partment  of  Forestry,  University  of  Wisconsin  -  Madison 


is  expressed  as  (Morton  et  al.  1956), 

M 

e  =  Y.Siqi.  (1) 

At  the  zygotic  level  the  number  of  lethal  equiv¬ 
alents  Ss  E  =  2e.  The  selection  coefficient  Si 
is  defined  from  the  convention  that  the  prob¬ 
abilities  of  survival  of  the  dominant  homozy¬ 
gote,  heterozygote,  and  recessive  homozygote  are 
1,1  —  h,Si,  and  1  —  s,-,  respectively,  and  hi  rep¬ 
resents  the  coefficient  of  dominance  at  the  i-th 
locus.  Lee,  Nordheim,  and  Kang  (1994)  sug¬ 
gested  a  new  experimental  and  statistical  mod¬ 
eling  strategy  that  leads  to  an  interval  estimate 
of  the  number  of  lethal  equivalents  based  on  one 
generation  mating  design.  In  conservation  biol¬ 
ogy  and  breeding  programs,  it  is  often  important 
to  have  more  information  on  the  parameters  in¬ 
volved  in  the  genetic  mortality,  i.e.,  M,  g,,  s,-,  and 
hi,  beyond  the  estimation  of  the  overall  mor¬ 
tality  effect  on  a  population.  To  attack  this, 
we  introduce  a  two  generation  mating  model, 
and  construct  a  corresponding  hierarchical  like¬ 
lihood.  The  complexity  of  the  likelihood  leads 
us  apply  a  Bayesian  Markov  chain  Monte  Carlo 
(MCMC)  to  enable  inference.  Data  from  a  two 
generation  mating  system  ensure  identifiability 
of  the  selection  coefficient.  For  one  generation 
data,  this  coefficient  is  confounded  with  the  num¬ 
ber  of  lethal  equivalents  (Lee  et  al.,  1994).  In  our 
MCMC  implementation,  a  new  two  dimensional 
proposal  distribution  is  used  for  parameter  hav¬ 
ing  an  attenuated  joint  posterior  distribution. 
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2  Hierarchical  Modeling  of 
Two  Generation  Mating 

We  consider  selling  as  the  mating  system  of  the 
experiment  here  because  of  its  simplicity  (see  Lee 
et  al.,  1994).  We  assume  that  the  dominance 
coefficient  (/i,)  of  lethal  (=deleterious)  alleles  is 
zero,  that  is,  the  deleterious  alleles  are  recessive, 
so  that  a  heterozygous  individual  with  these  al¬ 
leles  is  always  viable.  We  also  assume  a  single 
lethal  allele  frequency  q  per  gamete  over  all  loci 
carrying  lethals  (M  loci),  and  single  selection  co¬ 
efficient  s  for  these  lethal  alleles.  The  selling  ex¬ 
periment  has  several  hierarchical  steps.  First, 
we  assume  that  our  base  (monoecious  diploid) 
population  has  M  loci  with  deleterious  allele  fre¬ 
quency  q  per  locus.  From  this  population  we  ran¬ 
domly  sample  N  parents.  This  individual  sam¬ 
pling  can  be  expressed  by  two  random  variables 
-  number  of  heterozygous  (uii)  and  homozygous 
(tuii)  loci  of  the  i-th  parent  for  i  —  1,  ...N,  for 
the  lethal  alleles.  Next,  we  do  selfing  for  each  of 
these  parents  to  obtain  the  selfed  offspring  of  the 
first  generation,  nu,i  =  Next,  we  check 

the  number  of  unviable  offspring,  du,  for  each 
family  (parent).  The  genetic  mortality  of  each 
offspring  from  the  i-th  parent  can  be  represented 
as  a  random  variable  with  the  given  parameter 
values  vii,wii,  which  are  the  realizations  of  the 
first  sampling  distribution.  The  second  genera¬ 
tion  selfing  is  the  same  as  in  the  first  generation 
except  that  we  randomly  choose  one  viable  off¬ 
spring  from  each  family  line,  and  proceed  to  a 
second  generation  of  selfing.  The  sampling  and 
mortality  distributions  in  the  second  generation 
can  be  defined  as  n2i,d2i,V2i,  and  W2i  as  in  the 
first  generation.  Family  lines  can  be  lost  when 
no  viable  offspring  is  obtained  in  the  first  gener¬ 
ation  (Figure  1). 

If  we  assume  that  all  the  loci  with  lethal  al¬ 
leles  act  independently,  we  can  define  the  sam¬ 
pling  distribution  of  parents  (Pu)  by  a  multi¬ 
nomial  distribution  with  parameters  M,  total 
lethal  loci,  pi  =  and  p2  =  rela¬ 

tive  frequencies  of  homozygous  and  heterozygous 
genotypes,  respectively.  The  sampling  distribu¬ 
tion  of  the  second  generation  (P3,)  conditional 


Observed 

DATA 


p2i 


d2i 


p4i 


Figure  1:  Hierarchical  structure  of  two  genera¬ 
tion  mating  experiment  and  missing  data 

upon  the  parent  lethal  counts  is  a  function  of  the 
lethal  counts  vii,wii  and  selection  coefficient  s, 
but  not  of  parameters  M  and  q  because  if  a  par¬ 
ent  having  (ui,-,  tui.)  lethal  loci  is  chosen,  no  fur¬ 
ther  information  about  mortality  of  the  offspring 
is  gained  by  knowing  M  and  g;  the  relative  fre¬ 
quencies  of  homozygous  and  heterozygous  indi¬ 
viduals  after  viable  selection  is  a  function  of  both 
numbers  of  deleterious  loci  of  the  parent  and  se¬ 
lection  coefficient  (Falconer,  1981).  P^i  can  also 
be  derived  as  another  multinomial  distribution 
with  parameters  vu,  Pz  — 

Note  that  homozygous  loci  will  oe  transmitted 
as  homozygous  to  offspring  by  selfing.  If  we 
assume  that  the  viability  of  each  progeny  from 
one  parent  is  conditionally  independent  upon  a 
given  parent  genotype,  (uii,  wu)  at  the  first  gen¬ 
eration  (or  {v2i,W2i)  at  the  second  generation), 
the  mortality  distributions  of  the  two  genera¬ 
tions,  P2iiPii^  are  binomial  trials  with  param¬ 
eters  nki,  Qki  =  1  -  (1  -  t)"^‘(l  -  fc  =  1,2, 
i  =  l,...,iV,  respectively.  So,  the  complete  like¬ 
lihood  including  missing  data  can  be  derived  as 
the  product  of  these  four  distributions  over  all 
N  families:  Lc{M,  q,  s  :  {vii,  wu,  V2i,  tn2»}) 

=  nFliP2i(P3iP4,)^S  (2) 

where  Ui  is  the  indicator  function  whether  fam¬ 
ily  i  is  extinct  or  not  at  the  first  generation. 
However,  because  we  do  not  observe  lethal  loci 
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counts,  Vki,Wki,  k  =  1,2,  i  =  1,...,N,  equation  These  3  components  allow  us  to  update  each 
(2)  is  not  the  actual  likelihood  of  the  parameters,  state  value  in  the  situation  that  a  direct  update 
Importantly,  the  actual  likelihood  L{M,q,s)  is  is  intractable, 
a  summation  over  all  possible  configurations  of 


missing  data  in  equation  (2). 

3  Markov  Chain  Monte  Carlo 

3.1  Strategy 

The  Markov  chain  Monte  Carlo  (MCMC) 
method  (Gelfand  and  Smith  1990,  Smith  and 
Roberts  1993,  Besag  and  Green  1993)  method 
is  used  to  overcome  the  fact  that  the  actual  like¬ 
lihood  is  analytically  intractable.  Our  strategy 
is  to  apply  Bayesian  analysis  by  formulating  a 
prior  distribution  iro{M,q,s)  over  the  parame¬ 
ter  spax;e  and  then  using  MCMC  to  simulate 
the  joint  posterior  distribution  of  parameters  and 
missing  data.  This  distribution  has  density  tt 

K  oc  Lc{M,  q,  s  :  {uij,  wu,  V2i,  W2i})Tro{M,  q,  s) 

and  can  be  readily  evaluated  (up  to  constant). 
We  use  marginal  posterior  distributions  as  the 
basis  for  inference. 

So,  our  complete  likelihood  Lc  is  defined  over  a 
3-1-4  Tii  dimensional  space  of  parameters 
and  missing  data,  (M,q,s,vi,wifV2,W2).  We, 
first,  consider  independent  uniform  priors  on  the 
three  parameters;  independent  uniform  (0,1)  pri¬ 
ors  are  used  for  the  lethal  allele  frequency  and 
selection  coeflhcient;  we  consider  a  discrete  uni¬ 
form  prior  on  M  between  100  and  10,000  because 
the  total  number  of  lethal  loci  of  natural  species 
is  known  to  be  bounded  by  a  certain  number. 

The  MCMC  algorithm  has  3  component  chains 
-  each  modifying  different  aspects  of  the  state 
X  =  (M,q,s,vii,wu,V2i,W2i).  Each  component 
chain  is  defined  by  a  proposal  distribution 

q(x,  X*)  (3) 

that  says  how  candidate  state  are  generated 
given  current  state.  A  move  to  x*  occurs  with 
probability  that  is  the  minimum  of  1  and  the 
Metropois-Hastings  (MH)  ratio 

7r(x)  q(x,x*)  ' 


3.2  Missing  Data  Update 

A  Metropolis-Bastings  (MH)  algorithm  for 
updating  earh  of  the  missing  data  values, 
vitiWii,V2i,  and  W2t  is  used.  We  use  the  sam¬ 
pling  distributions,  PiuPsi,  in  the  hierarchical 
model  as  the  proposal  distributions  of  the  MH 
algorithm.  For  instance,  the  proposal  distribu¬ 
tion  of  Vii  can  be  derived  as  a  binomial  distribu¬ 
tion  with  parameters  M  —  wu  and  where 
Pi  and  p2  are  defined  as  in  the  previous  section. 
Then,  we  sample  a  candidate  u*,-  from  the  pro¬ 
posal  distribution,  and  calculate  the  MH  ratio  in 
equation  (4)  to  decide  whether  we  move  to  the 
new  state  value  Uj,-.  The  other  missing  data  can 
be  updated  in  a  similar  manner  to  this. 

4  Parameter  Updates 

Choice  of  proposal  distribution  in  the  MH  al¬ 
gorithm  can  dramatically  affect  efficiency.  For 
example,  proposals  which  tend  to  be  close  to  the 
current  state  yield  high  acceptance  rates  but  may 
lead  to  slowly  mixing  chains.  Similarly,  overly 
disperse  proposals  will  move  far  but  have  low 
acceptance  rates.  A  good  proposal  distribution 
balances  these  opposing  constraints. 

For  updating  s,  we  use  a  random  walk  pro¬ 
posal  distribution,  which  chooses  a  candidate 
uniformly  from  a  neighborhood  of  the  current 
state  value  of  s.  Specifically,  we  choose  a  can¬ 
didate  s*  uniformly  from  a  small  interval  with 
length  25  around  the  current  state  value  s,  and, 
then,  decide  whether  to  move  to  s*  or  not  by 
calculating  the  MH  ratio  of  s*  and  s  in  equa¬ 
tion  (4).  The  width  of  the  neighborhood  can 
be  determined  by  considering  the  acceptance  ra¬ 
tio  of  the  proposal  chain;  the  wider  the  width, 
the  lower  the  acceptance  ratio.  However,  if  it  is 
too  narrow,  the  chain  also  mixes  slowly  because 
a  candidate  cannot  move  far  from  the  current 
value.  So,  we  need  to  compromise.  This  pro¬ 
posal  chain  can  be  constructed  to  be  symmetric 
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by  adopting  an  interval  width  28  from  an  edge  if 
the  current  value  of  s  is  close  to  the  boundary. 
Thus,  the  transition  probabilities  of  the  proposal 
distribution  in  equation  (4)  can  be  canceled  out, 
so  that  the  MH  ratio  can  be  reduced  to  the  like¬ 
lihood  ratio  of  the  candidate  s*  and  the  current 
state  value  s. 

In  a  preliminary  study,  the  M  and  q  appeared 
to  be  highly  correlated  in  their  posterior;  they 
are  distributed  in  attenuated  region  along  a  re¬ 
ciprocal  line  (Figure  2).  Note  that  the  product 
of  these  two  parameters  should  be  kept  as  a  con¬ 
stant  under  the  fixed  number  of  lethal  equiva¬ 
lents  because,  basically,  the  product  represents 
the  average  number  of  lethal  alleles  carried  by 
an  individual  from  the  population. 


2000  4000  6000  8000  10000 

total  lethal  loci  (M) 


Figure  2:  Joint  posterior  of  (M,  q) 


Such  high  correlation  is  well-known  to  cause 
problems  for  any  single-site  or  componentwise 
updating  scheme.  By  using  two  single-site  pro¬ 
posal  distributions,  our  MCMC  implementa¬ 
tion  mixed  extremely  slowly  (data  not  shown). 
To  overcome  this  problem,  we  use  a  a  two- 
dimensional  proposal  distribution  based  on  the 
knowledge  that  lines  of  constant  Mq  will  have 
approximately  constant  posterior  density.  This 
two  dimensional  proposal  chain  is  not  symmet¬ 
ric,  and  the  ratio  of  the  proposal  transition  prob- 
abilities  IS  now  =  -rf-n-r  —  TTr. 


5((M,g),(M*,?*))  -  1/M  M* 


Two  dimensional  MH  Proposal  distribution 


1.  sample  M*  from  a  discrete  uniform  dis¬ 
tribution  in  a  bounded  interval. 

2.  for  some  constant  8,  sample  q*  uni¬ 
formly  from  the  interval 

Mq-S  ^  ^  Mq+S 

M~  <  y  M* 
where  M,  q  are  current  values. 

3.  calculate  the  two-dimensional  MH  ra¬ 
tio 

_ MM*,q*\re3t)  M 

^  n{M,q\rest)  M*  ' 

4.  move  to  {M*,q*)  with  probability 
min(r,l). 

Note  that,  in  step  2,  if  we  are  close  to  a  boundary 
of  the  support  of  q,  the  interval  with  length  ^ 
is  chosen  from  the  edge,  and  not  centered  at  the 
current  state  value  q. 

Intuitively,  our  update  method  does  not 
reparametrize  the  parameters  M,  q  in  the  model, 
but  construct  a  proposal  distribution  that  can 
move  along  a  reparametrized  space  of  them. 

5  Estimation  and  Marginal 
density 

We  apply  MCMC  to  a  data  set  simulated  un¬ 
der  parameter  values  M  =  3000,  q  —  .002,  and 
s  =  .45  (so,  E  =  2Mqs  =  5.4).  Using  1,500 
burn-in  time  and  every  tenth  subsampling,  we 
got  500  samples  of  {M,q,s,vi,Wi,V2,W2)  from 
the  MCMC  run.  The  posterior  means  of  s  and 
E  were  estimated  closed  to  the  simulated  param¬ 
eter  values  of  them  (Table  1). 

Table  1.  Estimates  of  posterior  samples 


E{.\data) 

std.  dev. (.| data) 

sel.  coef  (s) 

.471 

.059 

Lethal  equiv.(F') 

5.55 

.35 

lethal  loci  (M) 

6592.1 

2328.4 

allele  freq.  (g) 

1.12e-3 

.76e-3 

Our  two  generation  mating  data  do  not  give  a 
separate  information  about  parameters  M  and 
q.  As  shown  in  figure  2,  their  posterior  samples 
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were  attenuated,  and  either  mean  of  the  two  has 
a  large  sample  variation  of  the  posterior  samples. 

In  this  problem,  we  have  two  different  sources 
of  error  -  how  close  the  true  posterior  means  are 
to  the  parameter  (3000,.002,.45)  and  how  close 
the  Monte  Carlo  estimates  are  to  the  true  poste¬ 
rior  means.  The  error  by  the  MCMC  estimates 
is  different  from  the  distance  between  a  posterior 
mean  and  a  parameter  value.  So,  we  need  a  little 
of  caution  to  interpret  the  estimates. 

Figure  3  shows  the  (posterior)  marginal  den¬ 
sity  of  s  by  using  Rao-Blackwellization  (Gelfand 
and  Smith,  1990).  It  is  highly  concentrated 
around  the  posterior  mean  of  s. 


Figure  3:  Marginal  posterior  density  of  selection 
coefficient  using  Rao-Blackwellization 


6  Discussion 

Our  two  generation  data  allow  to  get  informa¬ 
tion  about  selection  coefficient  and  to  estimate 
the  number  of  lethal  equivalents  as  well.  In  our 
MCMC  implementation,  the  two-dimensional 
proposal  distribution  drastically  improves  effi¬ 
ciency  of  the  algorithm,  and  overcomes  a  typical 
difficulty  of  single-site  or  componentwise  updat¬ 
ing  scheme  for  highly  correlated  parameters. 

Our  data  and  likelihood  model  do  not  give 
information  about  parameters  M  and  q  except 
highly  attenuated  joint  distribution  of  them. 
This  seems  to  be  directly  related  to  identifiability 


problem  of  likehood  for  the  parameters.  We  also 
need  to  relax  many  restrictions,  e.g.,  single  al¬ 
lele  frequency,  single  selection  coefficient,  and  re¬ 
cessive  lethal  allele  assumptions,  which  requires 
to  investigate  both  different  statistical  modeling 
and  experimental  strategy.  The  difficulty  may 
be  turned  by  using  a  combination  of  different 
mating  systems. 
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RELATIVE  AGGREGATION  AND 
RANDOM  QUADRAT  SAMPLING 
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ABSTRACT. 

A  distributional  parameter,  the  relative  aggrega¬ 
tion  coefficient  (RAC),  is  derived  from  a  proba- 
blistic  model  of  random  quadrat  sampling.  New 
insight  of  random  quadrat  sampling  is  obtained 
from  the  derivation.  A  firm  theoretical  ground  for 
the  interpretation  and  the  uses  of  RAC  in  several 
statistical  applications  is  established. 

1.  INTRODUCTION 

Random  quadrat  sampling  is  a  tool  often  used 
in  statistical  ecology  (Pielou  1977,  Lloyd  1967, 
Ripley  1981,  and  Cressie  1991)  to  study  spatial 
distribution  of  individuals.  A  quadrat  is  a  small 
neighborhood  used  as  a  sampling  unit.  The  num¬ 
ber  of  observed  individuals  in  each  quadrat,  the 
quadrat  counts,  are  the  primary  data  for  infer¬ 
ence.  A  probabilistic  model  for  random  quadrat 
sampling  is  introduced  here.  The  observed  indi¬ 
viduals  are  regarded  as  i.i.d.  realizations  from  a 
population  distribution  Pf  having  a  density  /,  and 
the  positions  of  the  quadrats  are  regarded  as  i.i.d. 
observations  from  a  design  distribution  Pg  having 
a  density  g. 

Let  /X  be  the  mean  of  the  quadrat  counts,  and  let 
A  be  the  variance-to-mean  ratio  of  the  quadrat 
counts.  It  is  shown  that  under  a  proper  condition 
of  the  population  density  /,  A  is  approximately 
a  linear  function  of  p:  A  «  1  +  [a(/b)  --  1]a^i 
where  the  slope  is  determined  by  the  relative 
aggregation  coeffieicnt  (RAC)  of  the  population 
density  /  with  respect  to  the  design  density  g, 
a{f\g)  :=  J  Po/iJ  The  derivation  reveals 
new  insights  and  uses  of  random  quadrat  sam¬ 
pling. 

The  population  density  f  can  be  explored  using 
the  RAC  ct{f\g)  with  different  choices  of  g.  It  is 
shown  that  /  is  the  uniform  density  on  a  finite 
region  if  and  only  if  the  RAC  Oi{f\g)  =  1  for  any 
density  g  (/  itself  in  particular)  totally  concen¬ 
trating  on  the  same  region.  The  equivalence  of 
two  densities  is  characterized  by  equalities  among 
six  particular  RACs.  These  characterizations  pro¬ 
vide  a  firm  theoretical  ground  for  RAC-based  in¬ 


ferences  by  random  quadrat  sampling. 

2.  QUADRAT  SAMPLING  IN  PROXIM¬ 
ITY  SPACES 

A  proximity  space  is  a  pair  (D,  d),  where  D  is  a 
point  set,  and  d  is  a  non-negative  real  function 
on  2?  X  D  such  that  d{x,y)  =  d{y,x),  x,y  E  D. 
The  space  ]R*  with  any  ff  metric  (0  <  p  <  oo) 
forms  a  proximity  space  (]R**,F).  For  an  exam¬ 
ple  of  random  quadrat  sampling  in  the  Euclidean 
space  IR^,  see  Cressie  (1991)  pp. 588— 591.  To  ac¬ 
commodate  applications  with  non-Euclidean  data 
(see  Johnson  1989,  Johnson  and  Maggiora  1990  for 
examples),  the  concepts  are  developed  for  general 
proximity  spaces. 

For  convenience  of  discussion,  the  proximity  func¬ 
tion  d  is  regarded  as  a  distance  (the  larger  d{x,  y), 
the  further  apart  are  x  and  p).  For  any  x  E  D,  a 
“quadrat”  at  «  is  simply  a  neighborhood  Br{x)  = 
{y  e  D  :  d{x,y)  <  r},  r  >  0.  Note  Br{z)  mono- 
tonically  decreases  as  r  J.  0.  Let  (D,  !D,*/)  be  a 
measure  space,  where  ©  is  a  sigma  algebra  rich 
enough  so  that  all  the  open  neighborhoods  are  2)- 
measurable,  and  i/  is  a  complete  and  sigma-finite 
measure. 

Remark  2.1,  Given  a  proximity  measure  d,  a 
topology  T  can  be  generated  using  the  open  neigh¬ 
borhoods  Rr(a;),  ®  G  D,  r  >  0  as  basis.  Then  D  is 
the  smallest  sigma  algebra  generated  by  the  open 
sets  of  T. 

In  the  probabilistic  model  of  random  quadrat  sam¬ 
pling,  the  data  points  (e.g,  positions  of  plants 
in  a  field  under  study)  are  regarded  as  a  random 
sample  Xi,  X2,  from  a  population  distri¬ 

bution  Pf  on  {D,T>),  Pf  «v  with  the  popula¬ 
tion  density  /  ==  dPf/du»  The  positions  of  the 
quadrats  are  regarded  as  a  random  sample  from  a 
design  distribution  Pg  «  u  with  the  design  den¬ 
sity  g  =  dPgfdu.  Let  Z  Pg.  Define  the  quadrat 
count  for  r  >  0  at  X  to  be  the  random  variable 

A/ii  =  [number  of  Xi 's  in  the  quadrat  Br{Z)]. 

Define  the  mean  quadrat  count  pr  :=  E(M.),  and 
the  quadrat  count  variance-to-mean  ratio  Ar  := 
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YsLi{Afr) / fir  *  An  approximate  linear  relationship 
between  Ar  and  fir  can  be  derived  under  the  con¬ 
dition  that  as  r  — >  0,  a,e.  u 

f  mdu{i)  =  hfiz)^fiT)  +  o(^/(r)).  (2.1) 
JBr(z) 

The  functions  hf  and  are  assumed  to  depend  on 
the  population  density  /  and  the  proximity  mea¬ 
sure  d  (cf.  Examples  2.1-2. 3  below).  For  details 
see  Cheng  and  Johnson  (1994a). 

Theorem  2.1.  Under  condition  (2.1),  as  r  — ►  0, 


—  1  H- 


N-1 


N 


Mfla)  - 1 


Hr  +  o{Ht),  (2.2) 


where 

_  Sd  h){t)g{t)du{t) 


A[f\g)  will  be  referred  as  the  quasi  relative  aggre¬ 
gation  coefBcient  (quasi  RAC)  of  /  with  respect  to 
g.  If  hf  coincides  with  /  (see  e.g.  Example  2.1), 
the  quasi  RAC  becomes  the  relative  aggregation 
coefficient  (RAC) 


/(*!, *2)  =  4(25r)  ^ exp{-(a!j+a!|)/2}, *1, *2  >  0 

on  (IR^,^^).  /  is  infinitely  continuously  diffeten- 
tiable  with  bounded  mixed  derivatives  except  on 
an  edge  (aji  =  0  or  «2  =  0)  with  zero  measure. 
Set  the  design  density  p  =  /.  Then  by  Example 
2.1,  Aif\g)  =  A{f\f)  =  Sf/iSff  =  4/3. 

Example  2.3.  Let  /  and  g  be  the  same 
as  in  Example  2.2,  but  now  let  d  be  the  prox¬ 
imity  d(x,y)  =  1  -  |x'y/(||x||  •  ||y||)|,  where  the 
vectors  x,y  e  IR^,  and  ||xl|  denotes  the  length 
of  X.  Note  d(x,  y)  is  simply  one  minus  the  ab¬ 
solute  correlation  coefficient  between  x  and  y. 
It  is  convenient  to  use  polar  coordinates  in  this 
case.  For  a  point  z  G  IR^  with  polar  coordinates 
{pzi^z)i  and  0  <  r  <  1,  the  quadrat  Br{z)  =  {x : 
d(x,z)  <  r}  can  be  represented  in  polar  coordi¬ 
nates  as  J?tf(z)  =  {(p,^)  •  P  >  0,  \9  -  <  8}, 

with  6  =  arccos(l  —  r).  Note  B^(z)  is  a  cone. 
In  polar  coordinates  the  truncated  normal  density 
/  reads  f{p,6)  =  4(25r)-Vexp{-p2/2},  p  >  0, 
0  <  0  <  2/w.  Thus  /(p,  6)dpd6  = 

/o*  ///-/  4(2ir)-Vexp{-p*/2}dpd0  =  2x-^5. 

So  condition  (2.1)  holds  with  A/(z)  =  1  and 
^y((5)  =  28/ir,  giving  A{f\g)  =  /*?//(/ h//)*  = 


a(/|p)  = 


Id 

[Id 


(2.4) 


The  special  case  y  =  /  gives  a  new  distributional 
characteristic  a{f)  :=  a{f\f)  =  f/  (/^  /=*)*; 
call  it  the  self-aggregation  coefficient  of  /. 

Condition  (2.1)  demonstrates  the  interplay  be¬ 
tween  the  proximity  measure  d  and  the  population 
density  /  that  leads  to  the  approximate  linear  re¬ 
lationship  (2,2).  It  is  instructive  to  consider  the 
following  examples. 

Example  2.1.  In  the  Euclidean  space 
(IR*^,^^),  if  f  is  at  least  twice  continuously 
differentiable  with  bounded  mixed  derivatives, 
then  Taylor  expansion  establishes  f{i)dt  = 

f{z)vr  +  ^(vr),  r  — ^  0,  where  Vr  is  the  volume 
of  Br{z).  Note  when  d  is  the  Euclidean  distance, 
Br{z)  is  a  hyperball  of  radius  r  centered  at  z.  In 
fact  condition  (2.1)  is  satisfyed  for  any  distance 
(p  =  1, 2, ...,  oo),  if  /  is  at  least  twice  differentiable. 
In  this  case  the  function  A/  =  /  in  (2.1). 

Example  2.2.  Consider  the  truncated  bi¬ 
variate  normal  density 


These  latter  two  examples  demonstrate  that  the 
formation  of  the  quadrats  (determined  by  the 
proximity  d)  is  essential  to  what  may  be  observed 
from  the  quadrat  counts. 

The  integral  /  represents  the  local  popula¬ 

tion  density  in  the  quadrat  Br{z).  When  (2.1) 
holds,  the  discrepancy  between  the  population 
density  /  and  the  function  Ay  on  the  right-hand 
side  represents  the  possible  distortion  introduced 
by  quadrat  counts  in  reflecting  the  true  population 
density.  When  Ay  is  constant,  the  quadrat  counts 
reflect  essentially  uniformity  in  the  population. 

Theorem  2.2.  Under  condition  (2.1),  hf  ^  c 
(a  constant)  a.c.  Pg  if  and  only  if  the  quasi  RAC 
Mf\9)  =  1- 

Let  ^  be  a  random  variable  following  distribution 
P,.  Then  A{f\g)  =  E[h*(^)]/(E[A/(^)])*  >  1. 
The  uniformity  reflected  by  quadrat  sampling  is 
characterized  by  A{f\g)  attaining  the  lower  bound 
1.  Similarly,  the  uniformity  in  the  population  rela¬ 
tive  to  a  reference  density  g  is  characterized  by  the 
RAC  o(/|p)  =  1.  Define  the  support  of  a  density 
/  on  Z)  to  be  set  supp/  :=  {x  E  D  :  f{x)  >  0}. 
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A  general  characterization  of  uniformity  is  given 
by 

Theorem  2.3*  RAC  Characterization  of  unifor¬ 
mity.  Let  f  and  g  be  two  densities  on  D  with  re¬ 
spective  supports  Sf  and  Sg,  Assume  u{Sfr\Sg)  > 
0.  Then  i^{Sg)  <  oo  and  f  —  c  (a  positive  con¬ 
stant)  a.e.  1/  on  Sg,  if  and  only  ifoc{f\g)  =  1. 

Given  a  region  R  E  V  with  0  <  i/{R)  <  oo, 
the  uniform  density  can  be  defined  as  «(«)  = 
[i/(J2)]“^/h(*)»  ®  with  I  the  indicator  func¬ 
tion.  The  special  case  ^  /  gives 

Corollary  2*4.  Let  f  be  a  density  with  support 
Sf.  Then  u{Sf)  <  oo  and  f  is  a  uniform  density 
if  and  only  ifoc{f)  =  1. 

Since  cc{f\9)  >  1»  uniformity  is  characterized  by 
certain  RACs  attaining  the  lower  bound  1. 

Thereom  2.3  reveals  the  interpretation  of  the 
RAC  ct{f\g)  -  it  measures  the  contrast  between 
the  dense  and  the  sparse  regions  introduced  by 
the  population  density  f  relative  to  the  de¬ 
sign/reference  density  g.  The  uniform  distribution 
over  any  finite  region  introduces  no  density  con¬ 
trast  relative  to  any  density  totally  concentrating 
on  that  region.  For  further  discussions  and  exam¬ 
inations  of  Examples  2.2  and  2.3,  see  Cheng  and 
Johnson  (1994a). 

Remark  2.2.  Note  that  although  the  RAC 
a(/|^)  is  derived  for  the  random  quadrat  sampling 
model  which  involves  a  proximity  measure  d,  the 
definition  of  RAC  requires  nothing  but  two  prob¬ 
ability  densities.  Theorem  2.1  demonstrates  that 
if  a  proximity  measure  interacts  properly  with  the 
population  density  /  in  the  random  quadrat  sam¬ 
pling,  one  can  arrive  at  the  RAC  as  a  result, 
and  any  proximity  measure  satisfying  (2.1)  with 
hf  z=:  f  will  do.  In  general,  if  another  proxim¬ 
ity  measure  d'  generates  a  sub-sigma  algebra  P' 
of  V  (possibly  =  V)  in  the  way  described  in 
Remark  2.1,  then  all  the  quadrats  formed  by  d' 
are  2?-measurable.  Quadrat  sampling  can  be  per¬ 
formed  using  d'  as  well,  but  the  results  may  or 
may  not  agree  with  those  &om  the  sampling  per¬ 
formed  with  d,  as  show  in  Examples  2.2  and  2.3. 
The  use  of  quadrat  sampling  to  estimate  an  RAC 
will  be  briefly  discussed  in  Section  4. 

3.  FURTHER  PROPERTIES  OP  RAC 

Further  properties  of  RAC  are  highlighted  here. 
These  properties  establish  a  theoreticeil  ground  for 


the  use  and  interpretation  of  RAC  in  statistical  in¬ 
ferences  using  random  quadrat  sampling.  Detailed 
discussions  and  elaborations  appear  in  Cheng  and 
Johnson  (1994a,  b). 

Let  {DyVju)  be  a  measure  space  with  u  a  com¬ 
plete  and  sigma-finite  measure.  A  i/-density  func¬ 
tion  on  U  is  a  non-negative  real  function  /  satis- 

fyiag  Sd  =  1* 

The  RAC  a(/|ff)  possesses  interesting  invariance 
properties.  For  example,  it  is  invariant  under  lin- 
eat  transforms. 

Theorem  3.1.  Let  f  and  g  be  two  densities  on 
IR*  with  /  /</  >  0.  Let  T  be  a  k  x  k  nonsingular 
matrix,  and  J  =  det(T"^).  Fix  Xq  £  IR*.  Let 
f{x)  =  /(T-^(iB  -  xo))\J\,  and  g{x)  =  9{T-^{x  - 
a!o))l7l-  Then  a{f\g)  =  a{f\g).  In  particular, 
«(/)  =  «(/). 

Theorem  3.2.  Independence.  Let  f  and  g  be 
two  densities  on  IR^  with  //flf  >  0.  If  f  and  g 
are  such  that  f  =  Yl^ifif  9  =  Y[iLi9i>  ^here 
fi  and  9i  are  densities  on  1R*S  with  J  fiPi  >  0, 

^if\9)  =  UT=i<^ifi\9i)^ 

particular,  a{f)  =  «(/<)• 

bet  /i,  /2  be  two  densities  on  D,  and  let  f  = 
(/i  +  /2)/2,  i.e.,  the  density  of  the  even  mixture. 

Theorem  3.3.  oc{fi\f)  >  a{f\fi)  if  and  only  if 
J  fi  >  Sfifi-  «(/2|/)  >  «(/l/2)  if  and  only  if 

//l>//iV2. 

By  rewriting  the  inequality  J  fi  >  //|/i  as 
Sfifi  >  //l/i,  it  is  seen  that  the  inequality 
«(/i|/)  >  ac{f\fi)  reflects  the  discrepancy  in  the 
concentration  of  the  two  densities.  The  following 
theorem  demonstrates  the  equivalence  of  two  den¬ 
sities  is  characterized  by  equalities  among  certain 
RACs. 

Theorem  3.4.  fi  =  f2  a.e.  v  if  and  only 
if  “(/i)  =  and  <x{fi\f)  =  a(/|/i)  = 

«(/2|/)=«(/|/2)- 

Theorem  3.2  shows  the  behavior  of  RAC  under 
independence.  As  a  consequence  of  Theorem  3.4, 
independence  can  be  characterized  by  equalities 
among  six  particular  RACs.  For  i  =  1,2,  let 
(DijDi,!/,)  be  a  measure  space  with  Ui  a  com¬ 
plete  and  sigma-finite  measure,  let  P*  be  a  proba¬ 
bility  measure  on  (A»^»)j  ynth  den¬ 

sity  fi  =  dPi/dvi^  and  let  be  a  D^-valued 
random  variable  with  distribution  P<.  Let  P  be 
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the  joint  distribution  of  (11,12)  in  the  product 
space  {Di  x  where  Pi®2>2  is 

the  product  sigma  algebra  and  1/  =  z/i  x  1/2*  As¬ 
sume  P  «  1/  with  density  f  =  dPfdu.  By  defi¬ 
nition,  Yi  and  I2  are  independent  if  P  =  Pi  x  P2, 
or  equivalently,  /  =  /n  a.c.  i/,  with  fu  =  /i/2- 
Let  g  :=  {f  +  /n)/2,  the  even-mixture  density  of 
the  joint  and  the  product. 


Corollary  3.5.  >  (x{g\f)  if  and  only  if 

Sr  >  Sfnf’  >  a(ffl/n)  if  and  only 

*///n  ^  //*/n-  f  =  fn  a.e.  v  if  and  only  if 
“(/)  =  «(/n)  and  a{f\g)  =  a{g\f)  =  a{fu\g)  = 


A  general  measure  of  dependence  can  be  con¬ 
structed  by  combining  the  above  RACs. 


A(P.,P,):= 

[ 


(/n) 
«(/n|g) 
«(g|/n) 


«(/lg) 

«(/n|g) 


Note  A(Pi,  P2)  =  0  if  and  only  if  the  two  distiibu- 
tions  are  independent;  the  greater  A(Pi,P2),  the 
stronger  the  dependence  between  the  two  distribu¬ 
tions.  This  RAC  measure  of  dependence  reflects 
the  deviation  from  independence  by  detecting  the 
discrepancy  in  the  concentration  of  the  joint  and 
the  product  densities  in  the  sample  space.  It  has 
the  advantage  of  being  extremely  general,  and  the 
drawback  of  not  reflecting  the  detailed  nature  of 
the  dependence.  See  Johnson  et  ah  (1994)  for  a 
related  measure  of  dependence  calibrated  against 
the  correlation  coefficient  of  bivariate  normal  dis¬ 
tributions. 


4.  ESTIMATION  OF  RAC 

Estimation  of  the  RAC  a(/|y)  by  quadrat  sam¬ 
pling  is  briefly  discussed  here  in  general  terms. 
Theorem  2.1  suggests  the  following  quadrat  count 
moment  estimator  of  a(/|p)  under  the  condition 

(2.1)  with  h/  =  /  in  a  proximity  space  (U,d). 

(4.1) 

where  N  is  the  total  number  of  observed  individ¬ 
uals,  r  >  0  is  a  number  close  to  0  in  condition 

(2.1) ,  and  and  my,  are  respectively  the  sample 
variance  and  sample  mean  of  the  quadrat  counts 
from  a  sample  of  size-r  quadrats  taken  from  the 
the  design  distribution. 


Let  n  be  the  number  of  random  quadrats.  The 
consistency  of  moment  estimators  implies  that  the 
S(/|p)  convergence  in  probability  as  min(i\r,n)  f 
00  to  the  function  a{Ay,ypLy,)  =  l+(A^//i,.)~(l//i,.) 
with  and  Af  the  quadrat  count  mean  and 
variance-to-mean  ratio  respectively.  Under  condi- 
tion  (2.1)  with  hj=  f,  a(^,^)  =  a(/|g)  +o(l), 
r  i  0,  JV  I  00,  so  a(/|g)  consistently  estimates  an 
approximation  of  the  RAC. 
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Efficient  Computation  of  Statistical  Procedures  based  on  Subsetting  the 

Observations 
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Abstract 

Many  statistical  techniques  require  that  computa¬ 
tions  be  done  on  all  subsets  of  size  r  in  a  data  set 
of  size  n.  Typically,  this  is  done  lexographically,  i.e., 
with  nested  do-loops.  If  an  exchange  one  point  update 
formula  is  available,  then  it  is  used  on  the  inner  loop. 
In  this  paper  we  discuss  a  method  of  counting  through 
all  subsets  of  size  r  in  a  data  set  of  size  n  by  chang¬ 
ing  only  one  element  as  one  goes  from  one  subset  to 
the  next.  The  advantage  of  such  methods  is  that  an 
update  formula  can  be  used  at  every  step,  thus  poten¬ 
tially  saving  computation  time.  The  method  used  to 
compute  the  next  subset  in  the  list  requires  some  com¬ 
putation  time,  and  thus  the  new  method  will  only  be 
faster  if  the  update  formula  is  sufficienty  faster  than 
doing  the  computation  from  scratch. 

1  Introduction 

Statistical  procedures  such  as  jackknife  estimation, 
influence  diagnostics,  cluster  analysis,  and  permuta¬ 
tion  tests  call  for  computations  on  all  size  r  sub¬ 
sets  of  an  n  element  observation  dataset.  As  a  re¬ 
sult  these  procedures  can  be  very  computer-intensive. 
Though  computational  speed  is  improving,  these  pro¬ 
cedures  can  easily  exceed  available  resources  and 
therefore,  are  not  considered  for  use  in  some  appli¬ 
cations,  thus  algorithm  efBciency  plays  an  extremely 
important  part  in  accessing  applicability.  In  this  pa¬ 
per  we  will  combine  different  subset  generating  meth¬ 
ods  with  iterative  updating  techniques  and  build  al¬ 
gorithms  that  minimize  the  number  of  floating  point 
operations(FLOPs)  necessary  for  computations.  This 
FLOP  minimization,  as  a  result,  vdll  expand  the  sit¬ 
uations  in  which  these  procedures  can  be  used. 

The  remainder  of  this  paper  will  be  organized  as 
follows.  In  section  2  we  will  describe  subsetting  proce¬ 
dures.  In  section  3  we  will  look  at  the  computation  of 
subsetting  procedures  based  on  different  subset  gen¬ 
erating  techniques.  In  section  4  we  will  discuss  the 
prove  of  the  existence  of  change-one  subset  genera^ 
tors  and  give  an  algorithm  for  one  such  method.  In 
section  5  we  will  look  at  the  relative  efficiency  of  using 
different  subset  generators  in  general  computational 


procedures  that  allow  for  iterative  updating. 

2  Subsetting  Procedures 

Computing  the  statistical  procedures  mentioned  in 
the  introduction  requires  sequentially  generating 

all  (”)  subsets  of  the  n  observations.  If  the  obser¬ 
vations  are  indexed  by  the  set  {1, . . . ,  n},  then  let  all 
size  r  subsets  be  denoted  by  Sn,r  = 

where  s*  =  and  1  <  r  <  n.  Using  this 

set  of  indice  subsets  we  could,  for  example,  compute 
Delete-d  Cook's  Distance, 

A.  (1) 

by  sequentially  counting  through  the  subsets. 

Cook's  Distance  measures  the  influence  that  subset 
=  {(YijXj)  liesk}  has  on  the  regression 

model 

Yixl  =  ^nxp^pxl  +  ^nxl»  (2) 

where  6  is  an  estimate  of  and  6,^  is  an  estimate  of 
/?  based  on  the  size  r  =  n  —  d  subset  indexed  by  sjj.. 
Clearly  the  computer  intensive  part  of  this  procedure 
is  in  computing  bsj^  for  each  subset.  Another  example 
is  the  Delete-d  Jackknife  variance  estimation, 

(9)  «  E  l^»*^**  I  (9  -  9.J  , 

Sh^Sn,n.-d 

(3) 

given  by  Wu[l]  for  the  regression  model  in  (2) .  Here 
^  ^  (6)  is  a  smooth  function  of  the  estimated  regres¬ 

sion  coefficients.  The  computer  intensive  aspect  is  in 
calculating  bgj,  and  for  all  subsets. 

The  examples  given  above  can  be  put  into  a  gen¬ 
eral  context  which  we  will  call  Subsetting  Procedures. 
A  subsetting  procedure  is  an  aggregate  calculation 
or  operation  involving  a  basic  function  of  each  spec¬ 
ified  size  subset  of  the  observations  of  interest.  For 
instance,  Cook’s  Distance  calculates  regression  coeffi¬ 
cients  for  each  r  =  n  — d  size  subset  of  the  observation 
with  the  aggregate  operation  being  a  list  or  partial 
list  of  the  influence  measure  of  each  subset.  Likewise, 
Wu's  jackknife  variance  estimation  calculates  a  func¬ 
tion  of  the  observations  for  each  subset  with  the  ag¬ 
gregate  being  the  sequential  sum  over  all  the  subsets. 
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So,  based  on  these  examples,  every  subsetting  proce¬ 
dure  will  have  a  basic  function  or  operation  that  is  re¬ 
peatedly  calculated.  If  this  basic  operation  is  called  0, 
then  with  the  observations  given  by  X  =  ,  Xn} 

and  the  subset  by  Xs^  =  {Xi  a  subset¬ 
ting  procedure  will  calculate  for  each 

€  5'n,d.  Depending  on  the  ultimate  calculation  in¬ 
volved,  computing  will  usually  involve  part  of  the 
overall  computations  on  the  current  subset.  In  Delete- 
d  Cook’s  Distance  the  basic  operation  would  be  cal¬ 
culating  the  regression  coefficients  while  the  overall 
computation  for  each  subset  is  the  influence  measure 
based  on  those  coefficients.  Hereupon  we  assume  eval¬ 
uating  represents  the  bulk  of  calculations  made 
involving  the  A’th  subset  in  the  subsetting  procedure. 


3  Computing 

Calculating  a  subsetting  procedure  is  easily  accom¬ 
plished  using  the  following  algorithm. 


for  Ar  =  1  to  (") 
generate  subset  Sk 

end. 


(4) 


The  term  /  above  is  the  aggregating  operation  of  the 
subsetting  procedure,  this  term  will  be  suppressed  in 
subsequent  algorithms.  The  question  now  is  how  to 
generate  the  subsets.  Usually  the  subsets  are  gen¬ 
erated  lexicographically  or  alphabetically.  An  algo¬ 
rithm  for  computing  the  subsetting  procedure  in  (4), 
generating  Sn^z  lexicographically,  follows. 


for  ii  =  1  to  n  —  2, 

for  £2  =  n  -f  1  to  n  -  1, 
for  £3  =  £2  -j- 1  to  n, 
sk  =  {iuh.h} 

Os,  =  0(X,,) 
k  =  k  -\-l 

end 

end 

end. 


For  the  above  algorithm  if  we  look  at  the  subset  gen¬ 
eration  alone,  we  have  for  n  =  5, 

r  Si  =  {1,2,3},  S6  =  (1,4,5),  } 

S2  =  (1,2,4),  sr  =  (2,3,4), 

S5,3=<  S3  =  {1,2, 5),  S8  =  {2,3,5), 

S4  =  (1,3,4),  S9  =  (2,4,5), 

^  S5  =  (1,3,5),  sio  =  (3,4,5)  J 


The  algorithm  in  (5)  is  straight  forward  and  easy  to 
code  for  any  size  problem,  but  depending  on  the  pro¬ 
cedure  involved,  dataset  size  and  subset  size  this  can 
be  a  formidable  task!  To  lessen  this  cost,  we  need  to 
improve  computational  efficiency.  If  we  assume  that 
the  formula  or  logical  structure  in  computing  0^^  is 
already  efficient  then  the  only  other  choice  we  have  is 
to  use  an  iterative  updating  scheme  that  relies  on  the 
results  or  partial  results  of  computing  So,  to 

make  the  above  algorithm  more  computationally  effi¬ 
cient  suppose  we  can  calculate  0,^  by  updating  . 
The  following  code  is  based  on  using  an  update  inside 
the  inner  loop  of  the  algorithm  (5). 

)!:  =  1 

for  j'l  =  1  to  n  —  2, 

for  12  =  -f  1  to  n  —  1, 

Sk  =  {»1,»2,*2  +  1) 

k  =  k+l 

for  13  =  12-1-2  to  n, 

Sk  =  {*1,12,  is) 

0,,  =  Update  {0,,_, ,  X,, .  X.', ,  X.-,_i) 
jfe  =  ib-|-l 

end 

end 

end. 

(6) 

Notice  the  difference  between  this  code  and  the  code 
given  in  (5),  The  update  formula  relies  on  the  single 
element  change  made  in  the  subset  during  the  inner 
loop  of  the  algorithm.  If  we  suppose  that  change-one 
generated  subsets  allow  for  more  efficient  computa¬ 
tion  of  subsetting  procedures  and  we  have  a  change- 
one  type  of  update  then  we  would  want  to  use  the 
update  more  often  during  the  calculations.  Code  for 
such  a  procedure  would  have  the  form, 

Os,^0isi) 
for  k  —  2  to  (”) 

generate  subset  Sk  hy 
changing  one  element  in  Sjb-.i  ^  ' 

0,^=  Update 
end. 

4  Change-One  Generator 

The  use  of  iterative  updates  in  scientific  computing 
is  well  established.  In  our  case  since  we  are  iterating 
through  a  sequence  of  subsets  then,  as  we  did  above, 
placing  the  update  where  there  is  a  minimal  change 
between  consecutive  subsets  will  give  a  more  efficient 
updating  scheme.  Assuming  our  . procedure  allows  for 
a  change-one  type  updating  scheme,  does  there  exist 
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an  alternative  method  of  generating  subsets  that  will 
list  all  subsets  by  making  single  element  changes? 

Leo  W.  Lanthroum  discovered  in  1965  an  algo¬ 
rithm  that  generates  5n,r  in  which  a  single  element 
is  changed  in  one  subset  to  generate  next.  Chase[2] 
presented  code  for  this  algorithm  and  subsequent 
methods  based  on  a  modified  Binary  Reflected  Gray 
Codes  were  developed  by  Nijenhuis  and  Wilf[3],  Bit- 
ner  et.al.[4],  and  recently  Brezovec  and  Lee[5].  These 
methods  will  be  called  chaiige-one(Cl)  subset  genera¬ 
tors  throughout  the  remainder  of  this  paper.  To  illus¬ 
trate,  55,3  when  generated  by  a  change-one  generator 
becomes, 

'  Si  =  {1.2, 3},  56  =  {2,4,5},^ 

S2  =  {1,3,4},  S7  =  {3,4,5}, 
C1[S’5.3]=<  53  =  {2,3,4},  58  =  {1,3,5}, 

54  =  {1,2,4},  59  =  {2,3,5}, 

55  =  {1,4,5},  510  =  {1,2,5}  , 

Notice  that  one  and  only  one  change  is  made  in  go¬ 
ing  from  one  subset  to  the  next.  By  using  only  the 
changing  element,  Cl  [5n,r]  can  also  be  described  by 
an  initial  subset  sj^n.r  and  the  set  of  ordered  pairs, 
5]^  =  (outfcjinjfe),  where  in^;  is  the  indice  of  the  ele¬ 
ment  entering  the  k^th  combination  and  out*  is  the 
exiting  indice.  The  resulting  list  of  order  pairs,  call  it 
Sj,  S3,  ... ,  s^^^,  will  be  for  C1[55,3], 

f  s/,5,3  =  {1,2, 3},  S?  =  {1,2},  'I 
si  =  {2,4},  5?  =  {2,3}, 

55,3=<  si  =  {1,2},  s5  =  {4,l}, 
s5  =  {3,l},  si  =  {1,2}, 

{  s^  =  {2,5},  sIo  =  {3,l}  J 

The  method  for  constructing  and  proving  the  ex¬ 
istence  of  change-one  subset  generators  is  given  in  Ni¬ 
jenhuis  and  Wilf([3,  1975]).  Their  method  can  be  de¬ 
scribed,  easily,  by  letting  Sn,r  =  S2, . . . ,  j* 

be  a  change-one  generated  subset  list  with  si  = 
{1,2,  ...,r}  and  =  {1,2, ..  .,r  —  l,n}  .  Then  if 

5n,r  represents  5n,r  but  in  reverse  order,  we  have 

I 

S„,r  =  <  Sn-2,r-2  U  {n  -  1,  n}  ,  i  (8) 

,  5n— 2,r— 1  U  {n}  .  J 


Thus  for  any  n  and  r,  where  0  <  r  <  n,  it  follows  that 
5n,r  exists.  A  point  of  interest  is  that  this  method 
will  build  any  subset  list  independent  of  how  the  mi¬ 
nor  subsets  5n-i,r,»S'n-2,r-2,  and  Sn-2,r-i  Were  gen¬ 
erated.  For  our  case,  assuming  the  minor  subsets  are 


change-one  generated,  the  result  is  a  change-one  list. 
The  proof  is  by  induction. 

An  algorithm  that  actually  generates  a  change- 
one  subset  list  is  more  complicated  than  the  descrip¬ 
tion  given  in  (8).  The  algorithm  described  below  is 
by  Brezovec  and  Lee[5]. 

4.1  A  Change- One  Subset  Generating  algo¬ 
rithm 

One  method  of  generating  a  change-one  subset  list 
5n,r  is  to  let  Si  =  {1, . . . ,  r}  and  then  go  from  Sk  = 


{*1,.. 

.ir}  to  sjb+1,  by  first  determining 

9  = 

min  {k  :dk>  0) 

(9) 

where 

f  n-ir 

if  fc  =  r. 

dk  = 

\  4+1  —  4  - 

■  1  if  i  <  r  and  r  —  is 

even, 

(  h-h 

if  r  —  A?  is  odd. 

Then 

setting  sjfe+i  = 

{*!,.. •,ir}  where. 

h 

if  fc  €  {?  +  1, . ..  ,p} , 

4+(-ir* 

11 

if  k  q  —  1  and  r  —  k 

is  odd, 

k-i 

otheriwse. 

If  no  k  satisfies  (9)  then  is  the  last  member  of  the 
list.  In  addition,  we  can  generate  the  list  5*  by 
using  the  subset  Sk  and  the  number  q  found  in  (9) . 

Let  =  {Ij  •  )  3.nd  sj  =  {r  —  l,r  -f  1}  ,  now 

setting  =  {out,  in} ,  where 

_  /  *5-1  if  9  >  1  and  r  -  g  -j- 1  is  even 

I  +  otherwise, 

^  f  ia^i  if  u  >  1  and  r  —  o  -f  1  is  odd 
=  1  i,  etherise. 

5  Relative  Efficiency 

Our  goal  is  to  make  subsetting  procedures  more  effi¬ 
cient  by  reducing  the  computational  cost.  The  spe¬ 
cific  cost  can  be  execution  time,  number  of  floating 
point  operations  or  both  combined.  Here  we  will  use 
the  amount  of  floating  point  operations  (FLOPs)  to 
measure  the  cost,  with  the  standard  convention  of 
counting  each  addition  as  one  FLOP  and  each  multi¬ 
plication  as  one  FLOP. 

To  improve  the  efficiency  of  a  subsetting  proce¬ 
dure  we  will  concentrate  on  the  basic  function  6^^^  and 
the  subset  generator,  which  will  be  either  change-one 
or  lexicographic.  Suppose  it  takes  at  least  F  FLOPs 
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to  compute  where  jF  is  a  function  of  the  size  of 
the  subset  dataset.  The  total  FLOPs  for  computing 
for  all  Sk  G  5'n,r  will  be  at  most  (”)  •  F.  This 
would  be  the  amount  needed  to  compute  a  general 
algorithm  like  (5) ,  call  this  method  straight-forward- 
lexicographic  (SFL) .  The  amount  of  FLOPs  need  to 
compute  a  procedure  based  on  the  inner-loop  update 
(ILU)  algorithm  (6)  is  at  most  (”Zi)  •  F-f  •  7  •  F, 
where  7  •  F  is  the  amount  of  FLOPS  needed  to  com¬ 
pute  the  update  function  and  0  <  7  <  1.  Here  we 
make  the  obvious  assumption  that  the  update  is  less 
costly  to  compute  than  the  basic  function.  Addition¬ 
ally,  if  we  base  our  computations  on  the  change-one  al¬ 
gorithm  (7),  updating  each  time,  then  the  FLOPs 
needed  will  be  F  +  [(")  -  l]  •  7  •  +  o  •  (")  where  a 

is  the  maximum  FLOPs  needed  to  generate  an  indice 
subset.  Using  the  above  naive  FLOP  counts  to  eval¬ 
uate  the  efficiency  of  computing  a  general  subsetting 
procedures  based  on  the  three  algorithms  presented  in 
section  5,  we  have  the  following  relative  efficiencies. 

rel(ILU,SFL)  =  ^  +  7^,  (10) 

r.l(Cl,SrL)  =  T  +  f,  (11) 

r,l(Cl,ILU)  =  (12) 

Our  use  of  relative  efficiency  (rel)  implies  that  A  is 
more  efficient  than  B  if  rel(A,  B)  <  1.  Using  this  def¬ 
inition  of  relative  efficiency  we  see  that  a  subsetting 
procedure  utilizing  a  Cl  type  algorithm  will  be  more 
efficient  than  a  SFL  algorithm  if  a  <  F  •  (1  —  7),  or 
generating  a  subset  costs  less  than  the  saved  cost  in 
updating  the  basic  function.  Moreover,  a  Cl  algo¬ 
rithm  is  preferred  when, 

«<F(1-7)^,  (13) 

or  the  cost  of  computing  a  change-one  subset  is  a  frac¬ 
tion  of  the  saved  cost  in  updating  the  basic  function. 

6  Conclusion 

The  above  discussion  relies  on  the  existence  of  a  stable 
update  of  the  basic  function.  Finding  a  stable  update 
may  not  be  a  trivial  matter  when  looking  for  ways  to 
improve  a  subsetting  procedure.  This  is  beyond  the 
scope  of  this  paper.  We  did  show  that  a  subsetting 
procedure  can  be  improved  if  an  update  exists.  That 
is  nothing  new,  but  from  (13)  we  see  that  a  Cl  algo¬ 
rithm  can  only  improve  a  subsetting  procedure  to  a 
limit.  The  limiting  factor  being  the  cost  of  computing 
a  change-one  subset.  This  is  certainly  intuitive  and  it 
does  open  the  door  for  further  research.  This  research 


would  be  to  find  a  subset  generator  that  requires  a 
minimal  amount  of  FLOPs,  hopefully  depending  on 
n.  The  algorithm  we  gave  in  section  4  requires  at 
most  n  FLOPs  to  generate  a  subset.  This  limits  the 
situation  in  which  it  will  be  useful. 
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A  frequency  domain  bootstrap  for  time  series  distribution  of 
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Abstract.  The  properties  of  a  bootstrap  based  on 
studentized  periodogram  ordinates  are  investigated.  We  give 
a  ccwrection  which  emulates  the  dependence  structure  of  the 
periodogram.  Furthermore,  we  study  the  case  of  tapered  data. 


1.  Introduction.  In  time  series  analysis  there  exists  no 
canonical  way  to  bootstrap  the  observed  data  set.  The  reason 
is  that  one  essentially  has  only  one  multivariate  observation 
available.  In  order  to  construct  a  bootstrap  one  therefore 
needs  additional  informations  (e.g.  on  the  dependence 
structure  of  the  process  or  on  the  statistic  to  be 
bootstrapped).  As  a  consequence  a  variety  of  different 
bootstrap  methods  have  been  suggested  which  have  their 
merits  in  different  situations  (cf.  Kiinsch,  1989;  Liu  and 
Singh,  1988;  Politis  and  Romano,  1992;  Freedman,  1984; 
Kreiss  and  Franke,  1989;  Hurvich  and  Zeger,  1988).  In  this 
paper  we  study  a  frequency  bootstrap  based  on  studentized 
periodogram  ordinates.  Although,  the  periodogram  ordinates 
at  different  frequencies  are  asymptotically  independent 
(which  is  the  basis  for  this  bootstrap  idea  -  cp.  Franke  and 
Hdrdle,  1992)  the  minor  dependence  sums  up  in  certain 
statistics  to  a  nonvanishing  contribution.  Thus,  an  ordinary 
bootstrap  with  an  independent  bootstrap  sample  does  not 
lead  to  a  valid  bootstrap  approximation  for  certain  statistics. 
We  therefore  suggest  in  this  paper  a  modification  which 
leads  to  a  dependent  frequency  domain  bootstrap  sample. 

2.  The  method.  Let  X^,  teZ  be  a  real- valued  stationary 
time  series  with  spectral  density  f  and 

I^a)  =  — 1 —  1X7=  1  -  50  expIiXt)!^ 

2k  H2,t  T 

be  the  tapered  periodogram  with  the  data~taper  h:  [0,1]  R 
and  h(-i-)^.  We  are  interested  in  the 

distribution  of  spectral  mean  estimates,  i.e.  in  the 


This  paper  will  appear  in  the  Proceedings  on  the  Interface 
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where 

A((t>,f)  =  io  <!>(«)  f(a)  da. 

Examples  are  estimates  for  the  covariance  function 

((tKa)  =  2  cos  an)  and  the  spectral  measure 

((i)(a)  =  The  asymptotic  distribution  of  (2.1)  is 

well  known.  It  is  under  suitable  regularity  conditions  a 
Gaussian  distribution  with  mean  zero  and  variance 

(2.2)  v=  Cij[27iJo<|>^(a)f^(a)da  +  (K4/o'‘) 

(fj(|)(a)f(a)  da)^],  where  c^  =  HhllJ  /  Ilhll^  . 

(cf.  Dahlhaus,  1983).  Here  it  is  assumed  that  is  a  linear 
process  with  innovation  sequence  where  var(Ej)  =  and 
cum4(e^)  =  K4 . 

We  now  approximate  the  distribution  of  (2.1)  by  a 
bootstrap  in  the  frequency  domain.  The  basic  idea  results 

from  the  fact  that  Ij(a)/f(a)  are  for  different  a  ^  0  mod  % 
asymptotically  independent.  This  suggests  the  following 

bootstrap  procedure.  Let  n  =  [T/2]  and  Ij  = 

(2.3)  Bootstrap  procedure 

(a)  Obtain  the  sample  of  periodogram  ordinates  {Ij}  for 
j  =  1,  ...  ,n. 

A 

(b)  Obtain  an  estimate  fi  of  the  spectral  density  (e.g.  a 
kernel  estimate).  Let  {fj}  =  (fil^i)}. 

(c)  Calculate  the  studentized  periodogram  ordinates 
{ej)-{Ij/fj}.  ^ 

(d)  Rescale  ej  and  consider  {£j }  =  {ej/e . }  where 

(e)  Draw  independent  bootstrap  replicates  {ej }  from  the 
empirical  distribution  of  {ej } . 

(f)  Define  the  bootstrap  periodogram  values  by 

^  A  ^ 

The  rescaling  in  (d)  avoids  an  unneccessary  bias  at  the 
resampling  stage.  We  now  can  approximate  the  distribution 
of  (2.1)  by 

(2.4)  Vf(B((t),li)-B((l),f)) 
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where 

A 

If  sup  1  fyCa)  -  f(a)  I  0  a.s.  then  we  can  check  under 
a 

suitable  regularity  conditions  that  (2.4)  is  also  asymptotical 
normal  with  mean  zero  and  variance 

(2.5)  27c  Jo  <l)^(a)  f^(a)  da  . 

A  comparison  with  (2.2)  implies  that  the  bootstrap  can 
only  be  consistent  if  c^  =  1  and  K4  =  0. 

To  get  an  idea  how  to  improve  the  above  bootstrap  we 
may  look  more  detailed  at  the  correlation  of  the  peiiodogram 
at  neighbouring  peiiodogram  ordinates.  By  some  standard 
cumulant  calculations  (cf.  Brillinger,  1981)  we  obtain  for 

2 

j  ke  {1 . n)  in  the  simplest  case  h(x)  s  1,  f(a)  =  2_  . 

In 

(2.6)  cov  (Ij,  I]j)  =  fj  fjj  {8j]£  +  T”') 

(for  afbitraiy  linear  processes  and  h(x)  s  1  one  can  establish 
the  same  result  with  a  remainder  0(1dLX)  if  j  ^  k  and 

t2 

0(ll^)  if  j  =  k).  This  implies 
vai<VT(B((t).lT)-B(<t.,f)) 

=  T  f?  +  (K^)  (J  (|,j  fj)2 

n^i  a 

which  tends  to  v  as  in  (2.2)  with  cj,  =  1.  Therefore,  we  need 
a  bootstrap  sample  that  fulfills  the  analogue  to  (2.6).  As 
shown  below  this  is  fulfilled  by  the  foDowing. 

(2.7)  Modified  bootstrap  procedure. 

(a)  -  (e)  as  in  the  bootstrap  procedure  (2.3). 

(f)  Take  an  estimate  114  of  tj4  :  =  (K4/a‘^)  and  define  the 
bootstr^  periodogram  values  by 

-  (tj  (£,* + «i + 1  - 1)  1  xj,,  (4-1)1) 

As  we  show  below  this  leads  to  a  consistent  bootstrap 
approximation  if  C|j  =  1,  i.e.  if  no  taper  is  used  or  if  the 
taper  disappears  asymptotically.  We  have  no  idea  how  to 


modify  the  above  bootstrap  for  a  general  taper.  However,  in 
the  following  theorem  we  modify  the  statistic  to  receive  a 
valid  approximation  also  in  the  tapered  case. 

Franke  and  Hardle  (1992)  have  used  the  bootstrap  (2.3) 
without  data-taper  for  bandwidth  selection  of  a  kernel 
estimate.  Due  to  the  lower  rate  of  convergence  the  fourth 
order  cumulant  term  disappears  in  the  asymptotic  variance 
for  these  estimates  and  the  above  problems  therefore  do  not 
occur. 

An  estimate  of  TI4  =  K4/a^  may  be  obtained  e.g.  by 
fitting  a  high  order  autoregression  and  calculating  the 
empirical  fourth  order  cumulant  and  the  empirical  variance 
of  the  estimated  residuals  (this  is  a  bit  contrary  to  the  idea 
of  a  purely  nonparametric  bootstrap).  A  nonparametric 
estimate  can  be  constructed  in  the  following  way  (cp. 
Grenander  and  Rosenblatt,  1956,  chapter  6.5):  If 

Cjj.  =  cov  (Xj,  X(^]j)  and  djj^  =  cov(Xj  ,Xf^.jj^)  then  it  is  easy 
to  show 

Zk=-~  =  2  Xk=-o»  c(k)^  +  (K4/04)  (J_ jjf(a)da)^ 

I.e.  we  obtain  with  the  spectral  density  f2  of 


We  may  now  obtain  a  consistent  estimate  of  K4/a^  by 
estimating  the  expressions  in  this  formula. 

3.  The  validity  of  the  bootstrap 

To  establish  the  validity  we  need  the  following 

assumptions. 

(A.1)  Xj,  teZ  is  a  linear  process,  i.e. 

“  X  %  -n 
neZ 

with  i.i.d.  random  variables  with  Ee^  =  and 
cum4(e^)  =  K4.  Furthermore,  let 

Sn  1%!  <  «>  and  inf  f(a)  >  0  . 

a€  [0,71] 

A  ^ 

(A.2)  f  is  a  uniformly  strong  consistent  estimate  of  f,  'n4  is 
a  strongly  consistent  estimate  of  TI4. 
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(A.3)  <t>:  [-  -4  R  is  of  bounded  variation  and 

symmetric.  Let  <|)j  s  <|>(^-). 

(A.4)  The  data  taper  h:  R  [0,1]  is  of  bounded  variation 
with  h(x)  =  0  for  xg  (0,1)  and  /o  dx  >  0. 

Furthermore,  let  d2(F,G)  =  {E(X  -  Y)^)  be  the 

X'-F 

Y'-G 

2 

Mallow’s  metric  and  C|jj  =  T  H4  j  /  H2^x  • 

Theorem.  Assume  (Al)  -  (A5).  Then  we  have  for  the 
bootstrap  procedure  (2.7) 

d2(VT(A((|),Ix)  -  A((t),f)),  VTChT  (B((1),I*)  -  B((|>.f))  ^  0 

a.s.. 

Proof.  We  only  give  a  sketch.  It  is  sufficient  to  prove  the 
weak  convergence  of  both  statistics  to  the  same  limit  and 
the  convergence  of  the  second  moments.  For 
VT(A(<t>,Ix)  -  A(<|),f))  this  follows  e.g.  from  Dahlhaus 
(1983,  Theorem  2).  Standard  time  series  calculations  yield 
for  the  conditional  expectation  and  the  conditional  variance 

of  £j  given  the  original  sample 

E*ej  =  l 
aid 

TT:  =  var*£j=  {1 5:j‘=i  Gj/^)^- 1} 

almost  surely.  Direct  calculations  now  show  that 

cov*(I*,I^)  =  Tt  fj  fk{8jk  + 114  'E'M 
(i.e.  we  have  emulated  the  expression  (2.6)).  This  implies 

vai(VT^(B(<l),lT)-B(<|),f)) 

=  ChT  51"=!  fj  +  114  Zj*=l  <l>j  fj)^} 

which  almost  surely  tends  to  v  as  in  (2.2).  Since 


Vt^(B((1>,i^)  -  B((t),f))  =  ^  1 

{<t)jfj  +  wn)  ^  <|)k  fk)  (e*  - 1) 

with  d4=  {1  +  -L  -  1  the  asymptotic  normality 

2 

follows  from  the  central  limit  theorem  for  a  triangular  array 
of  independent  variables. 

We  therefore  have  found  a  frequency  domain  bootstrap 
which  also  works  in  the  non-Gaussian  case.  Concerning  the 
data  t^er  the  result  is  not  satisfying  since  we  emulate  the 
increase  of  the  variance  only  by  a  constant.  However,  in  the 
case  of  an  asymptotically  vanishing  taper  (which  is  a 
realistic  assumption  from  a  practical  point  of  view)  we  may 
omit  the  factor  Cjjj . 
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Abstract 

A  new  system  is  discussed  which  allows  one  to  access  data 
buried  in  very  large  complex  databases  far  faster,  literally 
thousands  of  times  faster,  and  with  more  data  insight  than  is 
possible  using  conventional  relational  database  management 
systems.  The  system  known  as  TempleMW  is  based  on  US. 
Patent  No.  5228119.  It  allows  users  to  visually  select  records 
based  on  criteria  imposed  on  one,  two  or  up  to  ten  indepen¬ 
dent  variables  and! or  on  the  minimum,  maximum,  mean,  sum 
or  standard  deviation  of  a  dependent  variable  in  any  or  all 
subspaces  of  the  ten  dimensional  independent  variable  space. 
A  multidimensional  graph  of  the  data  is  in  view  during  the 
selection  process.  The  independent  variables  may  be  categor¬ 
ical,  ordinal,  continuous  or  any  mixture  thereof.  Data  involv¬ 
ing  tens  of  millions  of  records  and  ten  variables  can  be 
viewed  in  seconds  on  a  486  computer  running  Microsoft  Win¬ 
dows  or  a  UNTK  workstation. 

Introduction 

The  technology  to  collect  and  store  vast  quantities  of  data 
has  grown  rapidly  over  the  past  decade.  Satellite  surveys, 
credit  card  reports  and  supermarket  scanner  records  are  all 
testaments  to  the  perceived  importance  of  collecting  informa¬ 
tion.  Unfortunately  the  techniques  available  to  analyze  and 
utilize  these  tremendous  data  warehouses  are  limited. 

The  multivariate  problem  in  general  has  many  difficulties, 
but  these  problems  are  compounded  by  the  quantity  of  data 
involved.  Slow  access  times  make  even  routine  tasks  tedious. 
Interactively  exploring  the  dataset  is  nearly  impossible  with 
conventional  methods. 

In  previous  papers  we  have  described  a  novel  method 
for  analyzing  multivariate  data,  called  MultiVarlate  Visual¬ 
ization  In  this  paper  we  apply  MW  to  the  problem  of  access¬ 
ing  the  information  in  very  large  databases. 

A  Discrete  Approach 

MW  treats  the  three  types  of  variables  —  categorical, 
ordinal  and  continuous  —  on  an  equal  footing.  Continuous 
variables  are  binned  into  intervals,  ordinal  variables  may  be 
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grouped,  and  categorical  variables  are  assigned  an  order.  This 
converts  any  type  of  variable  into  a  sequence  of’  bins”.  Vari¬ 
ous  binnings  may  be  chosen  to  change  the  available  resolu¬ 
tion  and/or  different  orderings  for  categorical  variable  values 
may  be  better  suited  for  different  analyses. 

The  Multivariate  Summary  Tree 

By  binning  the  n  variables,  the  n-dimensional  space  is 
divided  into  N  =  bi*b2*b3*  ...  *bn  segments  called  primitive 
cells,  where  b,  is  the  number  of  bins  for  the  ith  variable.  Each 
primitive  cell  corresponds  to  a  unique  specification  of  bin  val¬ 
ues  for  the  variables. 

A  "multivariate  summary”  is  created  by  tabulating  the  fol¬ 
lowing  five  statistics  for  each  primitive  cell.  The  number  of 
records  in  the  cell  are  counted,  and  for  one  or  more  dependent 
variables  the  sum  and  the  sum  of  the  squares,  as  well  as  the 
minimum  and  the  maximum  are  calculated. 

After  specifying  an  order  for  the  variables,  a  tree  structure 
is  created.  The  primitive  cells  are  associated  with  the  leaves, 
or  the  bottom  level,  of  the  tree.  The  nodes  for  the  next  level 
of  the  tree  combine  the  values  for  the  first,  or  fastest  running. 
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Figure  1  -  The  tree  structure  for  three  independent 
variables  A,  B,  C  with  2, 3  and  3  bins  respectively.  Here  A 
is  called  the  ^‘fastest”  variable  and  C  is  the  slowest 
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Number  of  People  by  Gender,  Age  and  Education 
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Figure  2  •  The  number  of  people  N  with  a  given  level  of  education  E  in  a  specific  age  bracket  A  of  a 
particular  gender  G  is  represented  by  the  heights  of  the  narrow  black  vertical  bars.  The  number  N  for  given 
A  and  G  irrespective  of  E  is  represented  by  the  wider  white  bars.  Finally,  N  for  each  G  irrespective  of  E  and 
A  is  represented  by  the  two  gray  bars.  Each  bar  type  has  its  own  scale. 


variable,  creating  statistics  which  correspond  to  specific  bin 
values  for  the  other  n-1  variables.  The  next  level  excludes  the 
second  variable,  and  so  on,  A  total  of  n+1  levels  (including 
the  root)  are  generated  like  this,  with  each  level  showing  pro¬ 
gressively  less  detail.  The  root  of  the  tree  contains  the  five  sta¬ 
tistics  for  the  entire  data  set 

The  resulting  data  structure  is  much  more  easily  manipu¬ 
lated  than  the  original  dataset  (its  size  depends  only  on  the 
number  of  variables  and  their  resolution — not  on  the  size  of 
the  original  dataset),  yet  it  retains  most  of  the  multivariate 
information  of  the  original  data.  If  reference  to  the  original 
records  is  required,  this  may  be  included  for  the  relatively 
small  overhead  of  a  single  integer  per  record. 

Figure  1  shows  the  nodes  of  an  MW  tree  for  the  simple 
case  of  three  variables  A,B  and  C  with  2,3  and  3  bins  respec¬ 
tively.  Variable  A  is  called  the  “fastest  running”  variable 
because  it  cycles  through  its  values  A1  and  A2  (at  the  bottom 
of  the  tree)  faster  than  does  variable  B  or  C.  B  is  the  second 
fastest  and  C  the  slowest.  Nodes  at  the  bottom  of  the  tree  con¬ 
tain  the  five  statistics  described  above  for  all  records  in  each 
of  the  2*3*3=18  “primitive  cells”  (P.C.).  Each  primitive  cell 
corresponds  to  spanning  just  one  bin  for  each  of  the  three 
variables  A,  B  and  C.  The  first  level  of  the  tree  above  the  bot¬ 
tom  level  consists  of  nodes  which  contain  the  five  statistics 
for  all  records  in  the  “first  hierarchical  cells”  (1st  H.C.). 
These  cells  span  just  one  bin  for  variable  B  and  one  for  vari¬ 
able  C  but  span  all  bins  for  the  fastest  variable  A  (here  just  A 1 
and  A2).  Similarly  the  next  tree  level  up  consists  of  nodes 


which  contain  the  statistics  for  “second  hierarchical  cells” 
(2nd  H.C.)  which  span  all  A  and  B  bins  but  only  one  C  bin. 
Finally  the  top  level  of  the  tree  has  just  one  node  which  con¬ 
tains  the  five  statistics  for  the  entire  dataset  i.e.  the  “root  cell” 
which  spans  all  A,B  and  C  bins. 

These  five  statistics  allow  one  to  recursively  calculate  the 
number  of  records  as  well  as  the  minimum,  maximum,  mean, 
standard  deviation,  standard  deviation  of  the  mean  and  sum 
for  any  variable  chosen  as  the  dependent  variable  of  interest 
at  all  node  levels  of  the  tree.  One  or  more  of  these  statistics 
can  then  be  used  to  drive  attributes  of  symbols  such  as  their 
size,  location  or  color. 

A  Nested,  Hierarchical  Display 

Figure  2  shows  census  data.  Here  N,  the  number  of 
records,  corresponds  to  the  number  of  people  and  is  broken 
down  by  gender,  age  and  education.  These  variables  have  2,7 
and  7  bins  respectively  with  education  the  fastest  running 
variable  and  gender  the  slowest  The  sum  statistic  (of  number 
of  people)  is  used  to  drive  the  symbol  attribute,  which  in  the 
case  of  Figure  2  is  just  the  height  of  vertical  bars.  Each  of  the 
three  bar  types  has  its  own  scale. 

The  black  bars  correspond  to  the  primitive  cells  where 
education,  age  and  gender  have  each  been  specified.  The 
white  bars  represent  the  number  of  people  for  a  specific  age 
and  gender  inespective  of  education,  i.e.  education  has  been 
summed  over.  These  are  the  “first  hierarchical  symbols”  cor- 
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responding  to  the  statistics  (here  the  sum)  for  the  first  hierar¬ 
chical  cells.  Finally,  the  two  gray  bars  represent  the  number 
of  people  by  gender  alone.  These  are  the  second  hierarchical 
symbols. 

Also  shown  in  Figure  2  are  a  set  of ‘TV  Widgets”  and  a  set 
of  “DV  Widgets”  The  bins  for  the  independent  variables  are 
represented  symbolically  by  the  IV  widgets.  The  DV  widget 
bins  represent  the  range  of  values  for  the  chosen  dependent 
statistic  (sum,  mean,  etc.). 

Visually  Guided  Data  Access 

For  a  single  variable,  a  histogram  is  a  valuable  aid  for 
determining  which  intervals  of  the  variable  are  of  interest. 
For  two  variables,  there  are  many  distributions  which  cannot 
be  deduced  from  their  marginal  histograms  alone.  Fch*  more 
than  two  variables,  the  possibilities  for  complex  behavior 
increase.  A  truly  multivariate  graph  can  help  one  specify 
ranges  of  interest,  pick  out  important  features  or  determine 
overall  trends. 

For  example,  in  figure  3  the  size  of  the  circles  is  propor¬ 
tional  to  the  number  of  houses.  We  can  clearly  see  a  strong 
correlation  between  price  and  size  for  houses,  moreover  we 
can  see  how  this  correlation  shifts  with  location. 

In  previous  papers,  we  have  discussed  the  utility  of  the 
MW  technique  for  performing  statistical  analyses.  Here  we 
focus  on  using  the  graphics  to  intelligently  access  specific 
portions  of  a  large  database. 

IV  Range  Restriction 

The  simplest  way  to  choose  a  subset  of  records  is  to 
restrict  one  of  the  variables  to  a  portion  of  its  range.  We  can 
represent  this  symbolically  by  coloring  just  the  selected  bins 
on  the  IV  widget.  This  selects  just  those  records  whose  value 
for  the  chosen  variable  is  within  the  specified  range.  The  abil- 
i^  to  see  the  distribution  is  obviously  an  advantage  in  making 
these  selections. 

Consider  figure  3.  The  slowest  running  variable  is  location 
(urban,  suburban,  rural).  Selecting  a  single  bin  for  this  vari¬ 
able  corresponds  to  picking  all  primitive  cells  in  one  of  the 
three  rectangular  subgraphs.  Restricting  a  faster  running  vari¬ 
able  gives  a  differently  shaped  subset  of  primitive  cells.  Fig¬ 
ure  4  shows  some  examples  for  the  case  of  four  variables. 

Since  including  many  variables  can  quickly  lead  to  com¬ 
plicated  graphs,  an  alternative  is  to  restrict  a  variable  which  is 
not  shown  as  an  independent  variable  on  the  MW  graph.  By 
adjusting  the  restriction,  the  displayed  graph  will  evolve  to 
display  the  dependence  on  the  variable  not  shown.  For  exam¬ 
ple,  we  could  include  the  age  of  the  house  as  a  fourth  variable 
(represented  by  an  IV  widget)  and  watch  how  the  graph  of 
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Figure  3  -  A  plot  of  the  number  of  houses  (indicated  by 
the  size  of  circles)  versus  price,  size  and  location.  Price, 
size  and  location  have  been  binned  to  10, 10  and  3  values. 

figure  3  changes  as  we  focus  on  houses  of  different  ages. 
While  this  has  the  advantage  of  keeping  the  graphs  simple, 
the  visual  guidance  for  making  the  restriction  is  lost. 

DV  Contouring 

An  alternative  selection  method  uses  the  dependent  vari¬ 
able.  If  we  think  of  the  graph  symbols  as  “sticking  out  of  the 
page”  with  different  sizes  corresponding  to  different  eleva¬ 
tions,  choosing  all  symbols  of  a  certain  size  is  like  picking  out 
a  specific  elevation  on  a  topographic  contour  map.  This  is 
represented  symbolically  by  highlighting  a  portion  of  the  DV 
widget  Since  the  different  symbols  represent  different  levels 
of  detail  (depending  on  the  number  of  variables  included), 
several  levels  of  “coarse  grained”  or  “fine  grained”  contour¬ 
ing  are  available. 

It  is  important  to  understand  the  distinction  between  con- 
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touring  a  variable  as  a  DV  versus  introducing  it  as  an  IV  and 
restricting  its  range.  The  latter  (IV)  restricts  records  on  a  case 
by  case  basis,  the  former  (DV)  uses  the  properties  of  a  group 
of  records.  For  example,  the  independent  variable  income 
could  be  used  to  select  only  those  individuals  in  the  highest 
income  bracket  As  a  dependent  variable,  income  could  be 
used  to  select  a  group  (maybe  a  specific  age  group  or  a  spe¬ 
cific  age  and  education  group)  whose  mean  income  has  a  cer¬ 
tain  value,  or  whose  purchasing  power  (sum  of  income)  is 
highest 

In  figure  5,  we  show  the  average  capital  gain  of  stocks  as 
a  function  of  four  indices.  Contouring  on  the  highest  value  of 
the  capital  gain  gives  the  groups  (as  determined  by  the  values 
of  the  four  indices)  with  the  best  average  performance. 

DV  Symbol  Selection 

This  selection  method  is  more  robust  than  the  previous 
one.  Instead  of  choosing  all  symbols  of  a  specific  size,  one 
can  choose  individual  symbols.  This  is  more  appropriate 
when  the  relevant  quantity  is  not  the  absolute  value  of  the 
dependent  variable  but  rather  its  value  with  respect  to  neigh¬ 
bors. 

Consider  figure  5.  If  we  were  just  interested  in  the  global 
maximum  (or  minimum)  we  could  use  DV  contouring  and 
select  an  extreme  of  the  range.  The  graph,  however,  can  also 
show  us  the  behavior  around  the  extrema.  Both  I  and  II  indi¬ 
cate  cells  which  are  local  maxima.  However,  I  is  a  maximum 
which  is  stable  with  respect  to  changes  in  all  four  variables, 
but  n  is  very  sensitive  to  the  value  of  C. 

Depending  on  the  dependent  variable  rule  or  statistic  cho¬ 
sen,  such  a  selection  process  could  choose  particularly  vola¬ 
tile  stocks,  or  ones  which  are  undervalued,  etc. 


Other  Graphical  Access  Systems 

Johnson  and  Shneiderman  *  have  discussed  “Tree  Maps” 
an  alternative  multivariate  graphical  data  access  technique. 
Recently  Tweedie  et  al^  have  proposed  a  “drill  down”  type  of 
data  access  tool  in  which  parallel  axes  akin  to  those  intro¬ 
duced  by  Inselberg*®  are  used  to  display  multiple  (con¬ 
strained  or  unconsttained)  marginal  distributions  for  all 
variables  of  interest  Unlike  Inselberg,  these  authors  deal  only 
with  discrete  variables.  That  is,  they  follow  the  MW  model 
of  binning  any  and  all  continuous  variables  to  form  discrete 
ones. 

Although  the  MW  technique.  Tree  Maps,  and  the  modi¬ 
fied  Inselberg  approach  of  Tweedie  et  al.,  all  utilize  a  multi¬ 
variate  graphic^  approach  to  data  access  they  are 
fundamentally  different  in  terms  of  their  computational 
engines,  the  nature  of  their  graphical  presentation  of  multi- 
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Figure  4  -  Shown  in  parts  a,  b,  c  and  d  are  the  primitive 
cells  that  are  selected  when  one  constrains  each  variable 
(A,B,  C  and  D  respectively)  to  one  of  its  three  bins*  Here 
A  is  the  fastest  and  horizontal.  B  is  the  2nd  and  vertical. 
C  is  3rd  and  horizontal  and  D  Is  4th  and  vertical.  In  each 
case  27  primitive  cells  are  selected.  Constraining  twoIV’s 
e.g.  A  as  in  part  a  and  B  as  in  part  b  would  select  the  9 
primitive  cells  common  to  a  and  b,  i.e«  an  intersection. 


430  Accessing  Very  Large  Databases 


Figure  5  -  The  average  capital  gains  for  stocks  in  a  four  dimensional  space  of  fundamental  ratios  A,B,C,D. 


variate  data  and  their  scope  of  data  analysis  capabilities.  The 
MW  method  can  be  used  to  analyze  and  select  information 
from  very  large  databases  consisting  of  literally  tens  or  even 
hundreds  of  millions  of  records  with  subsecond  response. 
MW’s  graphical  presentation  can  be  used  to  find  trends  and 
correlations  which  may  be  important  factors  in  record  selec¬ 
tion  and  can  be  generalized  to  displaying  multiple  dependent 
as  well  as  independent  variables. 

Conclusions 

By  forcing  all  types  of  variables  to  be  discrete,  tremendous 
advantages  can  be  gained  in  the  manipulation  of  large,  muld- 
variate  databases.  The  summary  tree  described  above  can 
effectively  capture  most  of  the  multivariate  nature  of  a  large 
dataset.  The  nested,  hierarchical  graph  not  only  displays  mul¬ 
tivariate  information,  but  can  also  provide  an  intuitive  data 
access  system. 

MW  provides  one  method  for  intelligently  extracting  the 
information  from  large  multivariate  databases. 
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Abstract.  This  paper  describes  a  link  between  a  Geograph¬ 
ical  Information  System  (GIS),  ARC/INFO^^  ,  and  an  in¬ 
teractive  dynamic  graphics  program,  XGobi.  GISs  provide 
a  user  with  a  standard  and  convenient  software  for  spatial 
geographical  data.  In  particular,  the  GIS  ARC/INFO  is  a 
combination  of  two  systems:  ARC  maintains  the  spatial  in¬ 
formation  of  map  features  and  provides  tools  for  spatial  anal¬ 
yses  while  INFO  maintains  the  thematic  or  attribute  infor¬ 
mation  associated  with  the  map  features.  XGobi  is  an  inter¬ 
active  dynamic  graphics  program  for  data  visualization  in  the 
X  Window  System^^.  It  is  designed  for  the  exploration  of 
multivariate  data,  primarily  by  manipulating  and  displaying 
scatterplots  in  arbitrary  dimensions. 

The  motivation  for  the  work  is  to  link  the  dynamic,  inter¬ 
active  strengths  of  XGobi  for  visualizing  high-dimensional 
data  with  the  exhaustive  map  handling  tools  of  ARC/INFO, 
specihcally  to  explore  spatial  data.  This  paper  presents  in¬ 
formation  about  the  technical  realization  of  the  link  between 
ARC/INFO  and  XGobi  as  well  as  an  introductory  example 
of  its  use. 

1  Introduction 

Interactive  and  dynamic  graphics  for  high¬ 
dimensional  data  have  proved  useful  for  exploring  rela¬ 
tionships  among  multiple  variables.  Incorporating  sim¬ 
ilar  tools  in  the  context  of  spatial  data  promises  to  be 
a  valuable  aid  in  exploring  spatial  dependencies.  Ge¬ 
ographical  Information  Systems  (GISs)  have  developed 
sophisticated  capabilities  for  managing  multivariate  spa¬ 
tial  data  bases  but  limited  capabilities  for  conducting  in¬ 
teractive  exploratory  data  analysis.  The  combination  of 
a  GIS  and  a  dynamic  graphics  system  for  multivariate 
data  comprises  a  potentially  powerful  tool  for  interactive 
exploratory  spatial  data  analysis. 

In  Section  2  we  describe  general  features  that  should 
be  available  for  an  interactive  dynamic  graphics  tool 
that  operates  on  data  available  in  a  GIS  data  base. 
As  one  particular  application,  the  interface  between  the 
GIS  ARC/INFO™  and  XGobi,  an  interactive  dynamic 
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graphics  program  for  data  visualization  in  the  X  Win¬ 
dow  System™,  is  described  in  Section  3.  An  example  is 
given  in  Section  4.  We  conclude  this  paper  by  describing 
possibilities  for  future  work. 

2  Integration  of  Interactive  and  Dy¬ 
namic  Graphics  Tools  into  a  GIS 

The  inclusion  of  spatial  location  in  data  analyses  can 
be  addressed  in  different  ways.  Using  a  GIS,  we  maintain 
a  geographic  context  of  spatial  location  relative  to  land- 
cover,  streams,  roads,  and  other  relevant  information.  It 
would  also  be  feasible  to  include  the  spatial  coordinates 
as  two  (or  d)  additional  variables  into  the  analysis,  but 
this  approach  by  itself  does  not  exploit  the  considerable 
spatial  capabilities  of  GISs. 

Emphasis  in  GIS  development  has  been  on  the  input 
of  data,  its  management  (storage,  retrieval),  and  the  dis¬ 
play  of  maps,  graphs,  and  tables.  GISs  have  some  ca¬ 
pability  to  allow  statistical  analyses  but  it  is  generally 
limited.  A  number  of  recent  suggestions  have  been  made 
(e.  g.,  Openshaw,  1991;  Anselin  and  Getis,  1992;  Ding 
and  Fotheringham,  1992;  Fotheringham  and  Rogerson, 
1993)  to  redress  this  imbalance.  Still  others  have  in¬ 
corporated  some  dynamic  graphical  tools  into  systems 
that  lack  the  full  features  and  flexibility  of  a  GIS  (e.  g., 
Haslett  et  al.,  1991). 

Our  research  addresses  the  extremely  important  prob¬ 
lem  of  multivariate  exploratory  spatial  data  analysis  in 
a  GIS.  GIS  data  structures  allow  the  representation  of 
areal  features  (e.  g.,  for  the  storage  of  information  re¬ 
ported  at  an  aggregated  spatial  level,  such  as  counties 
or  census  tracts),  linear  features  (e.  g.,  for  the  storage 
of  information  collected  from  a  stream  or  a  transporta¬ 
tion  network),  and  point  features.  The  topological  data 
structure  of  a  GIS  makes  it  possible  to  determine  spatial 
relationships  between  sampling  locations,  such  as  stream 
sites,  that  would  be  difficult  to  determine  otherwise.  The 
display  capabilities  of  a  GIS  allow  the  spatial  variables  to 
be  overlaid  on  a  background  of  hydrography,  transporta¬ 
tion,  population,  land  use,  or  other  information  relevant 
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Figure  1:  ARC/INFO  control  panel  and  example  map  view  linked  to  two  XGobi  views. 


to  the  attributes^  being  considered.  For  example,  in  Fig¬ 
ure  1  the  map  view  shows  sampling  sites  along  streams 
in  Erath  County,  Texas.  Information  about  the  topog¬ 
raphy  or  land  use  near  a  sample  site  can  give  valuable 
insights  into  the  values  of  attributes  (e.  g.,  ammonia  con¬ 
centration)  collected  at  the  site. 

A  GIS  is  intrinsically  multivariate  and  yet  this  is  ig¬ 
nored  by  the  largely  univariate  statistical  analyses  cur¬ 
rently  available.  By  building  an  interface  between  a  GIS 
and  software  for  dynamic  graphics,  we  will  also  provide 
a  platform  for  developing  new  spatial  graphical  methods 
for  spatial  data  sets  available  in  the  GIS  (e.  g.,  Cook  et 
al.,  1994). 


^In  the  context  of  GISs  the  expression  aitrihute  is  used  instead 
of  the  statistical  expression  variable. 


3  The  ARC/INFO  to  XGobi  Interface 

Our  efforts  have  focused  on  interfacing  the  GIS  soft¬ 
ware  ARC/INFO  with  XGobi  (Swayne  et  al.,  1991). 
ARC/INFO  has  been  chosen  because  it  is  one  of  the  most 
frequently  used  GIS  systems  and  because  it  is  extensible 
through  its  macro  language,  allowing  the  development 
of  menus  and  programs  to  carry  out  ARC/INFO  tasks. 
XGobi  provides  interactive  and  dynamic  graphical  tools 
in  the  X  Window  System  environment  for  exploring  mul¬ 
tivariate  data  through  the  manipulation  of  scatterplots. 
ARC/INFO  is  used  to  maintain  the  GIS  data  base  and 
to  display  the  geography,  while  XGobi  is  used  primar¬ 
ily  to  explore  the  relationships  within  and  between  the 
attributes.  Figure  2  shows  how  the  communication  be¬ 
tween  these  two  programs  is  established. 
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Figure  2:  Interface  linking  ARC/INFO  with  XGobL 


3.1  The  ARC/INFO  Part 

ARC/INFO  is  used  to  display  the  location  of  sam¬ 
pling  sites  in  a  graphics  window.  The  sampling  sites  can 
be  displayed  on  a  background  of  roads,  streams,  or  any 
other  relevant  geographic  data  sets  available.  A  control 
panel,  the  upper  right  window  shown  in  Figure  1,  allows 
the  user  to  brush  or  subset  the  sampling  sites  interac¬ 
tively.  The  term  ^^brush”  refers  to  changing  the  sym¬ 
bol  used  to  represent  the  specified  points  and  “subset” 
refers  to  choosing  a  subset  of  the  points  for  further  anal¬ 
ysis,  disregarding  (temporarily)  the  other  points.  The 
brushed  (or  subsetted)  sites  are  redrawn  with  the  spec¬ 
ified  glyph,  size,  and  color,  and  the  ARC/INFO  data 
base  is  modified.  These  changes  are  detected  and  passed 
to  XGobi  by  the  intermediate  process,  as  described  in 
subsequent  sections. 

The  ARC/INFO  portion  of  the  application  is  imple¬ 
mented  with  AML  (Arc  Macro  Language)  and  works  as 
follows.  An  ARC/INFO  data  set  consists  of  a  set  of 
spatial  features,  in  this  case  points,  each  of  which  has  a 
record  in  a  data  base  table.  When  the  application  starts, 
a  column  in  the  table  is  initialized  with  a  default  value 
which  represents  the  symbol,  i.  e.,  glyph,  size,  and  color, 
with  which  to  draw  each  point.  As  the  user  interactively 
queries  the  points,  the  values  in  this  column  are  updated 
to  reflect  the  user’s  actions.  The  changes  to  this  column 
are  detected  by  the  intermediate  ARC/XGobi  server  pro¬ 
cess  and  sent  to  XGobi. 

Pseudo  code  for  the  ARC/INFO  part  is  given  below. 
In  this  pseudo  code,  the  control  panel  is  represented  by 
the  repeat  loop. 


set  current  symbol  to  default  symbol 
repeat 

wait  for  user  action 

case  user  action  { 

when  "identify  ARC/ISFO  data  set" 

initialize  the  symbol  column  to  current  s]rmbol 
when  "brush" 

spatially  select  points  to  brush 

set  symbol  column  of  selected  points  to  current  symbol 
when  "subset" 

spatially  select  points  to  subset 

set  symbol  column  of  selected  points  to  current  symbol 

set  symbol  column  of  other  points  to  0 
when  "clear  selection" 

reset  the  symbol  column  of  all  points  to  default  symbol 
when  "change  color" 

reset  current  symbol  to  reflect  changed  color 
when  "change  glyph" 

reset  current  symbol  to  reflect  changed  glyph 
when  "change  size" 

reset  current  symbol  to  reflect  changed  size 

> 

until  (forever) 

3.2  The  Intermediate  Process 

The  intermediate  process,  denoted  as  ARC/XGobi  In¬ 
terface  in  Figure  2,  has  to  serve  the  requests  of  the  XGobi 
clients  by  reading  information  from  the  ARC/INFO  data 
base.  The  interprocess  communication  between  this 
server  and  the  XGobi  clients  is  based  on  Stevens’  (1990) 
concurrent  server  example,  and  uses  a  Transmission  Con¬ 
trol-  Protocol  (TCP)  socket,  i.  e.,  an  Internet  stream 
socket.  Upon  receiving  a  connection  request  from  an 
XGobi  client,  the  intermediate  process  forks  an  identi¬ 
cal  child  process.  Each  child  process  communicates  with 
one  XGobi  client;  thus,  one-to-one  connections  between 
server  processes  and  XGobi  clients  are  established. 
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Obviously,  the  forking  of  child  processes  is  a  heavy 
weight  mechanism  to  provide  a  concurrent  server.  How¬ 
ever,  we  assume  that  this  mechanism  is  available  for  all 
hardware  environments  that  support  ARC/INFO.  An  al¬ 
ternative  for  some  workstations  (e.  g.,  DEC™)  is  the 
use  of  multithreads  which  are  light  weight  processes, 
but  this  approach  is  not  available  on  all  systems  (e.  g., 
Sun™/Sparc™  workstations). 

The  main  task  of  the  child  processes  is  the  follow¬ 
ing:  If  the  XGobi  client  indicates  that  it  wants  the  cur¬ 
rently  selected  ARC/INFO  data  set  and  future  updates 
of  this  selection,  the  related  ARC/XGobi  server  (child) 
has  to  check  continually  whether  the  ARC/INFO  data 
base  has  been  changed.  If  so,  the  modifications,  such  as 
new  brushed  or  subsetted  points,  are  immediately  passed 
to  the  corresponding  XGobi  clients. 

A  child  process  is  terminated  by  a  QUIT  command 
of  its  XGobi  counterpart,  or  if  it  detects  the  unexpected 
termination  of  the  client  process  or  the  breakdown  of 
the  communication  channel.  The  intermediate  (parent) 
process  will  operate  until  it  is  explicitly  terminated  by 
the  user.  The  pseudo  code  for  the  ARC/XGobi  interface 
follows. 

Parent: 

init  ARC/IHFO  defaults 

init  sockets 

repeat 

accept  connection  from  XGobi  client 
fork  child  process 
until  (forever) 

Children: 

repeat 

wait  for  input  from  XGobi  client  or  for  Timeout 
if  (input  received  =  SEHD  Filename) 

then  {send  data  from  file  Filename;  Update  =  false} 
else  if  (input  received  =  SEITD  current) 

then  {send  data  from  current  selection;  Update  =  true} 
else  if  (Timeout  and  Update) 

then  if  (current  selection  modified  since  last  send) 
then  send  update  of  current  selection 
until  (input  received  =  QUIT  or  abort  of  client 

or  channel  dovn) 

3.3  The  XGobi  Part 

There  are  several  methods  that  we  considered  when 
initially  contemplating  a  link  from  ARC/INFO  to 
XGobi:  directly  writing  new  functionality  into  XGobi, 
accessing  the  XGobi  data  structures  by  calling  XGobi  as 
a  subroutine,  or  using  the  linked  brushing  protocols  ex¬ 
isting  in  XGobi.  The  first  method  is  feasible  because  the 
code  for  XGobi  is  available,  but  it  is  undesirable  because 
it  would  require  maintaining  updates  with  new  releases 

is  a  trademark  of  Digital  Equipment  Corporation. 
'^^Sun  is  a  trademark  of  Sun  Microsystems,  Inc. 

Sparc  is  a  trademark  of  Sun  Microsystems,  Inc. 


of  the  XGobi  code.  The  third  option  strictly  limits  the 
interaction  to  the  data  structures  available  in  the  XGobi 
linked  brushing  code.  Calling  XGobi  as  a  subroutine 
from  a  small  control  panel  was  chosen  as  the  method 
that  best  suited  our  needs.  Almost  all  the  data  struc¬ 
tures  used  in  XGobi  are  available  for  modification  using 
this  approach. 

The  structure  of  the  calling  program  is  based  on 
the  subroutine  template  code  provided  with  the  XGobi 
source  code.  (The  subroutine  approach  also  has  been 
used  by  Littman  et  al.,  1992,  for  the  implementation 
of  the  XGvis  software  system.)  A  control  panel  is  ini¬ 
tiated  for  each  instance  of  XGobi  (see  Figure  1),  from 
which  the  user  has  the  option  of  selecting  data  from  an 
ARC/INFO  data  base  file  or  to  receive  the  data  set  that 
is  currently  selected  within  ARC/INFO.  Once  the  data 
source  has  been  determined  and  the  data  received,  the 
XGobi  window  is  initialized. 

Internally,  an  additional  working  procedure,  namely  a 
routine  that  runs  once  whenever  the  X  Window  System 
event  loop  finds  no  events,  has  been  added  to  XGobi  to 
check  for  incoming  data  from  the  ARC/XGobi  server.  If 
this  routine  receives  updates  of  the  data,  the  attribute 
values  currently  visible  in  XGobi  linked  to  the  brushed 
or  subsetted  coordinates  in  ARC/INFO  will  instanta¬ 
neously  be  set  to  the  same  glyph,  size,  and  color. 

Otherwise,  the  entire  functionality  of  XGobi  has  been 
maintained.  The  XGobi  part  can  be  described  via  the 
following  pseudo  code. 

init  sockets 

connect  to  ARC/XGobi  server 
init  XGobi  defaults 
init  startup  vindov 
repeat 

vait  for  user  input 
send  input  to  ARC/XGobi  server 
wait  for  data  from  ARC/XGobi  server 
if  (XGobi  not  invoked) 
then  invoke  XGobi 

else  update  XGobi  structures  and  data  sets 
until  (user  input  =  QUIT) 

3.4  Usage 

ARC/INFO  and  the  ARC/XGobi  (parent)  interface 
process  must  be  activated  on  the  host  where  the 
ARC/INFO  data  base  is  located.  Then,  XGobi  client 
processes  can  connect  to  the  ARC/XGobi  server-  Clients 
can  reside  on  the  same  host  or  anywhere  else  on  the  In¬ 
ternet.  Internet  addresses,  ports,  and  communication 
protocols  are  encoded  into  the  program.  So,  the  user 
does  not  have  to  worry  about  common  setups  for  server 
and  clients.  If  the  user  only  wants  to  use  XGobi  to  an¬ 
alyze  the  attribute  data  in  an  ARC/INFO  data  set,  the 
invocation  of  ARC/INFO  is  not  required. 
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4  An  Example 

As  an  example  of  how  the  link  between  ARC/INFO  and 
XGobi  can  be  used  to  explore  data  we  show  a  data  set 
containing  water-quality  data  collected  during  several 
weeks  at  seventeen  surface-water  sampling  sites  in  Erath 
County,  Texas  (see  Figure  1).  The  pollutants  are  being 
modelled  inter  alia  through  explanatory  variables,  such 
as  the  number  of  dairies  per  acre  or  the  number  of  head 
of  cattle  per  acre,  to  account  for  large-scale  variability. 

As  well  as  the  sampling  sites  (numbered  from  1  to  24 
with  some  numbers  missing),  the  ARC/INFO  mapview 
shows  streams  (continuous  lines),  boundaries  of  large 
basins  (dashed  lines),  dairies  (triangles),  and  a  town 
(shaded  area).  After  an  initial  examination  of  the  map, 
two  of  the  sampling  sites,  numbered  4  and  12,  have  been 
brushed  in  order  to  see  if  the  data  collected  there  is 
anomalous. 

Site  4  has  been  brushed  because  it  is  at  the  outlet 
of  a  very  small  basin  containing  four  dairies.  Thus, 
the  response  variables  (i.  e.,  pollutants)  might  be  ex¬ 
pected  to  be  unusually  high.  The  XGobi  view  on  the 
left  shows  that  the  explanatory  variable  “nda”  (the  num¬ 
ber  of  dairies  per  acre)  is  extremely  large  relative  to  the 
other  sites.  The  XGobi  view  on  the  right  shows  that 
the  responses  “no3”  (standardized  nitrate)  and  “nh3” 
(standardized  ammonia)  at  this  site  are  high,  though 
not  outlying. 

Site  12  has  been  brushed  because  it  is  located  just 
below  the  town  of  Stephenville,  Texas;  the  waste  wa¬ 
ter  treatment  plant  of  Stephenville  discharges  into  the 
stream  above  the  sampling  site.  The  XGobi  view  on  the 
right  shows  that  standardized  nitrate  is  consistently  high 
at  this  site,  but  nothing  remarkable  can  be  said  about 
standardized  ammonia.  Based  on  this  result,  an  analyst 
doing  an  exploratory  data  analysis  probably  would  be 
interested  in  how  strong  the  nitrate  concentration  is  fur¬ 
ther  downstream  of  Stephenville  and,  therefore,  the  next 
site  to  be  brushed  in  the  ARC/INFO  view  might  be  site 
24. 

This  example  demonstrates  briefly  how  geography  can 
give  an  analyst  insight  into  the  exploration  of  a  data  set 
and,  thus,  why  a  link  between  ARC/INFO  and  XGobi  is 
a  useful  tool. 

5  Future  Work 

We  have  presented  an  interface  linking  ARC /INFO  with 
XGobi.  This  interface  allows  the  user  to  link  ARC/INFO 
and  XGobi  views  interactively  such  that  modifications 
of  the  ARC/INFO  view  automatically  change  the  other 
views  in  the  different  XGobi  clients.  So  far,  this  link 
is  only  unidirectional.  Future  work  will  focus  on  the 
other  direction  of  the  link,  that  is,  the  update  of  the 


ARC/INFO  view  according  to  one  (or  several)  XGobi 
view(s).  However,  this  direction  is  more  problem¬ 
atic  since  substantial  questions,  such  as  security  (Who 
is  allowed  to  modify  the  data?),  concurrency  (What 
if  two  XGobi  clients  send  different  update  informa¬ 
tion  at  the  same  time?),  and  technical  issues  (How 
can  events  be  incorporated  into  a  primarily  non-event- 
driven  FORTRAN  program?)  have  to  be  resolved.  We 
are  also  looking  towards  a  new  release  of  ArcView™  . 
According  to  preliminary  announcements  from  the  dis¬ 
tributors  of  this  software,  this  new  version  seems  to  be 
more  suitable  to  facilitate  the  inverse  link  from  XGobi 
to  ARC/INFO. 
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Abstract 

This  paper  introduces  the  task  of  converting  summary  tables 
into  row-labeled  plots.  The  conversion  task  emphasizes 
exposition  of  important  patterns  in  data  rather  than  data 
archival.  Attention  to  graphical  design  details  yields  plots 
that  appear  simple  even  though  the  tables  are  fairly 
complex.  The  more  general  task  includes  converting 
multiway  tables  and  distributional  sununaries  to  plots.  This 
brief  paper  focuses  attention  on  two  templates  for  expressing 
two-way  tables  as  plots.  These  templates  are  variations  on 
familiar  dot  and  bar  plots  and  have  numerous  applications. 

1.  Introduction 

This  paper  advocates  the  use  of  row-labeled  plots  for 
graphical  presentation  of  tabular  information.  Row-labeled 
plots  (or  row  plots  for  short)  take  three  basic  forms,  dot 
plots  (charts),  horizontal  bar  plots  (charts),  and  horizontal 
distribution^  summary  plots,  such  as  boxplots.  While  the 
plots  are  familiar,  government  reports  still  seem  to  favor 
tables  over  plots.  Numerous  reasons  can  be  cited  for  the 
common  usage  of  tables:  historical  inertia,  an  emphasis  on 
data  archival  rather  than  on  communication,  limited  access 
to  software  that  produces  presentation  quality  graphics 
(especially  for  dot  plots),  and  an  absence  of  graphical 
paradigms  for  handling  the  challenges  posed  by  reexpressing 
tabular  information.  As  part  of  the  advocacy  for  row  plots, 
this  paper  provides  software  and  paradigms  that  address 
several  of  these  challenges. 

The  basic  tasks  in  converting  tables  to  plots  involve 
accommodating  the  tabular  structure  and  emphasizing  chosen 
comparisons.  Structure  related  challenges  include 
representing  several  factors,  handling  nested  factors,  showing 
many  levels  within  a  factor,  providing  resolution  for  a  large 
range  of  values,  and  showing  distributional  summaries. 
Emphasis  considerations  include  stressing  estimates  over 
confidence  bounds  and  calling  attention  to  the  more  accurate 
estimates.  Carr  (1994)  provides  examples  for  all  of  these 
cases  including  redesigned  boxplots.  This  paper  presents 
two  templates  for  converting  two  factor  tables  to  plots. 


2,  Two-Factor  Row  Plots 

In  row  plots  the  levels  of  one  factor  become  rows.  Row 
plots  accommodate  a  second  factor  in  one  of  three  ways:  by 
using  symbols,  by  using  multiple  panels,  or  by  showing 
the  levels  of  both  factors  as  rows.  Figure  1  provides  and 
example  using  symbols.  Rows  represent  the  16  levels  of 
the  carcinogens  factor.  Symbols  represent  the  two  levels  of 
the  years  factor.  Representing  the  two  levels  of  the  second 
factor  using  symbols  is  advantageous.  All  the  values  can  be 
compared  using  a  single  common  scale.  No  space  is  lost 
through  adding  panels  or  rows. 

The  symbols  used  in  Figure  1  emphasize  change.  Open 
circles  show  the  1987  values  and  the  arrow  tips  designate  the 
1988  values.  A  horizontal  line  from  the  1987  value  to  the 
1988  value  explicitly  shows  the  change.  When  the  change 
is  small,  a  circle  with  a  dot  represents  both  the  1987  and  the 
1988  values.  This  reflects  a  willingness  to  make  small 
adjustments  in  symbol  placement  (the  1988  value)  and  style 
for  graphical  simplicity  and  clarity. 

Several  additional  facets  of  Figure  1  reflect  design 
considerations:  grid  lines,  sorting  and  grouping  of  rows,  and 
the  log  scale.  The  horizontal  grid  lines  in  Figure  1  are 
white  lines  on  a  light  gray  background.  The  gray 
background  gives  the  plot  a  value-added  appearance.  The 
small  contrast  between  the  light  gray  and  white  lines  allows 
the  lines  to  be  perceived  as  part  of  the  background  rather 
than  competing  with  the  symbols  in  the  foreground.  The 
horizontal  lines  help  in  table  lookup  (matching  of  labels  and 
symbols). 

Two  design  considerations,  sorting  and  grouping  of  rows 
help  the  plot  appear  less  complex.  The  sorting  of  rows  by 
the  1987  values  reduces  the  visual  distance  between  the 
symbols  as  the  reader  scans  the  plot  vertically.  The 
conjecture  here  is  that  reducing  the  visual  distance  in 
looking  from  point  to  point  makes  the  plot  appear  less 
complex. 


1  This  work  was  supported  by  EPA  under  cooperative  agreement  No.  CR8280820-01-0.  The  article  has  not  been  subject  to 
peer  review  of  the  EPA  and  thus  does  not  necessarily  reflect  the  view  of  the  agency  and  no  official  endorsement  should  be 
inferred. 
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Air  Emission  of  Carcinogens 


Chemical 

Dichloromethane 

Tetrachloroethylene 

Benzene 

Styrene 

Chloroform 

Formaldehyde 

1.3- Butadiene 
Acrylonitrile 

1,2-Dichloroethane 
Ethylene  Oxide 
Carbon  Tetrachloride 
Propylene  Oxide 

Acetonitrile 
Vinyl  Chloride 

1.4- Dichlorobenzene 
Lead 


Top  Chemicals  By  Weight 


Figure  1.  A  row  plot  with  symbols  representing  the  levels  of  a  second  factor. 


I 


[he  grouping  of  rows  in  Figure  1  creates  smaller  percephon 
mits.  Four  groups  of  four  appears  more  manageable  than 
me  group  of  sixteen.  Grouping  rows  also  facilitates  the 
matching  of  symbols  to  row  labels.  While  the  honzontal 
grid  lines  help,  grid  lines  are  not  so  crucial  with  grouping 
because  matching  say  the  third  of  four  labels  with  the  third 
of  four  symbols  is  trivial.  The  grouping  of  rows  diminishes 
the  advantage  of  using  right-aligned  row  labels.  Cleveland  s 
examples  (1984.  1985,  1993a.  and  1993b)  show  the 
evolution  from  left-aligned  labels  to  right-aligned  labels. 
Right-aligned  labels  are  closer  to  the  horizontal  gnd  lines 
and  corresponding  symbols,  so  right-alignment  should 
reduce  the  chances  of  making  an  error  in  matching  labels 
with  symbols.  However,  the  conjecture  here  is  that 
grouping  makes  the  error  rate  very  low  so  that  there  is  little 


advantage  in  using  right-aligned  labels.  Here  the  preference 
is  to  follow  the  conventions  for  the  dominant  activity  in 
each  part  of  the  plot.  Reading  is  the  dominant  activity  in 
the  row-label  part  of  the  plot  so  the  labels  are  left-aligned. 

Figure  1  uses  a  log  scale.  The  carcinogens  selected  for  the 
table  motivating  Figure  1  represent  the  most  extreme  cases 
in  terms  of  pounds.  The  range  of  values  for  extreme  cases  is 
often  large  and  using  a  log  scale  helps  to  provide  resoluhon 
for  the  smaller  values.  The  difference  of  values  on  a  tog 
scale  is  a  monotonic  function  of  the  percentage  change.  The 
log  scale  is  fine  for  mathematically  sophisticated  audiences. 
For  more  general  audiences,  a  plot  showing  percentage 
change  on  a  linear  scale  would  be  helpful.  Of  coupe 
sorting  rows  by  percentage  change  and  the  other  design 
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STATE 

Texas 

Louisiana 

Ohio 

Florida 

Tennessee 

Michigan 

Illinois 

Indiana 

Utah 

Pennsylvania 

California 
Virginia 
New  York 
Missouri 
New  Jersey 

Mississippi 
Georgia 
North  Carolina 
Alabama 
Kansas 

Kentucky 
Wisconsin 
South  Carolina 
Arkansas 
Arizona 

Massachusetts 
West  Virginia 
Minnesota 
Conneticut 
Oklahoma 

Iowa 

Maryland 

Washington 

Alaska 

Oregon 

Montana 
Wyoming 
New  Mexico 
Colorado 
Maine 

Nebraska 
New  Hampshire 
Idaho 
Delaware 
Rhode  Island 

Hawaii 

South  Dakota 
Vermont 
Nevada 
North  Dakota 


TRI  Releases  And  Transfers  For  1987 

Totals  By  State  and  Distribution  Class 
Grand  Total  =  7  Billion  Pounds 


Figure  2.  A  row  plot  with  panels  representing  the  levels  of  a  second  factor. 
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considerations  still  apply. 

Sorting  and  grouping  considerations  carry  over  to  multiple 
panel  displays  as  in  Figure  2.  Figure  2  uses  rows  to 
distinguish  the  values  for  the  50  states.  In  this  case  the 
design  groups  rows  in  units  of  five.  Grouping  rows  into 
units  of  four  likely  has  some  cognitive  advantages  over 
units  of  five,  but  producing  ten  full  groups  of  five  seems 
reasonable.  The  toxic  release  class  factor  has  6  levels  plus  a 
marginal  summary.  Representing  seven  levels  using 
symbols  is  a  bad  idea.  In  fact  Kosslyn  (1994)  suggests  that 
distinguishing  among  more  than  four  elements  gets 
complicated.  Consequently  Figure  2  represents  the  levels  of 
the  second  factor  using  multiple  panels  rather  than  symbols. 

Figure  2  reflects  several  additional  design  considerations. 
The  plotted  symbols  are  bars.  Bars  are  visually  dominant 
area  symbols  that  allow  the  reader  to  quickly  scan  the  whole 
plot  even  though  there  are  separating  panel  lines.  Note  that 
the  bars  in  the  right-most  six  panels  are  on  the  same  scale 
so  are  directly  comparable. 

The  common  approach  to  creating  comparable  bars  is  to  use 
identical  width  panels  with  identical  scales  that  cover  the  full 
range  of  data.  The  result  of  this  approach  is  that  panels  with 
small  values  are  largely  blank.  The  right-most  six  panels  in 
the  Figure  2  have  different  widths.  Since  the  "underground" 
values  are  the  largest  the  "underground"  panel  is  widest. 
This  unequal  panel  width  approach  makes  effective  use  of 
the  available  space  while  preserving  comparability.  Given  a 
fixed  plotting  space,  the  range-driven  panel  width  approach 
uses  the  otherwise  blank  space  to  increase  the  resolution 
within  all  panels. 

Figure  2  includes  vertical  grid  lines.  Cleveland  (1993a, 
1993b)  provides  a  demonstration  that  shows  how  helpful 
grid  lines  are  in  making  more  accurate  comparisons  across 
panels.  Cleveland  notes  that  the  grid  lines  allow  attention 
to  be  focused  on  smaller  graphical  elements  and  that  Weber's 
law  helps  to  explain  the  increased  accuracy  of  comparison. 
Grid  lines  are  an  important  facet  of  the  graphical  design. 

Figure  2  uses  a  different  scale  for  state  totals  panel  to  save 
space  for  the  other  panels.  The  figure  calls  out  this  different 
scale  in  four  ways,  by  using  black  bars  rather  than  dark  gray 
bars  as  in  the  other  panels,  by  the  slight  separation  from  the 
other  panels,  by  warning  text  below  the  panel  and  by  the  tic 
labels.  In  addition,  the  grid  spacing  turns  out  to  be  different. 
The  design  places  the  state  total  panel  appear  first  among  the 
panels  because  it  is  an  executive  summary  and  the  basis  for 
the  sorting  of  rows. 

Figure  2  provides  a  quick  state-based  overview  of  the  toxic 
releases  for  the  different  release  classes.  The  table 
motivating  this  plot  appears  in  Courteau  1990.  The  table 


spans  pages  37  and  38  and  the  summarizes  the  company 
self-reports  with  nine  digits  of  accuracy.  The  table  is  truly  a 
visually  intimidating  table.  Table  design  considerations 
such  as  rounding  numbers,  sorting  rows  and  grouping  rows 
can  substantially  improve  the  table  for  exposition  purposes. 
However  most  readers  will  still  prefer  the  graphical 
summaries  like  Figure  2. 

A  single  plot  will  not  necessarily  cover  all  the  major 
exposition  objectives  for  a  table.  Most  people  are  interested 
in  the  values  for  their  state.  People  from  states  like  Hawaii, 
won't  see  much  in  Figure  2.  More  resolution  is  desirable. 
An  additional  plot  showing  the  six  release  class  values  as 
percentage  of  state  totals  would  be  helpful  as  a  summary. 
Individual  state  maps  can  show  further  detail.  The  tabular 
summaries  of  the  Toxic  Release  Inventory  can  lead  to  many 
visual  representations. 

3.  Comments  and  Conclusions 

Historically,  tables  in  government  documents  served  a  data 
archival  role.  Today  electronic  storage  better  serves  this 
archival  role.  Government  publications  need  to  change  from 
an  archival  orientation  to  a  data  exposition  orientation. 
While  some  tables  may  remain  because  tables  can  be 
advantageous  for  careful  quantitative  analysis,  most  people 
prefer  the  more  qualitative  visual  understanding  provided  by 
plots.  The  development  of  graphics  templates  and 
corresponding  software  will  facilitate  making  this  change. 

The  government  community  primary  has  access  to  the 
spreadsheet-based  business  graphics  developed  in  the  1970's. 
Much  has  been  learned  about  graphic  design  in  the  last  two 
decades.  Those  that  study  graphic  design  know  that  stacked 
bar  plots  and  pie  charts  are  inferior  visual  representations  of 
data.  Nonetheless  such  graphics  commonly  appear  in 
government  publications  because  the  available  software 
make  the  graphics  easy  to  produce. 

The  two  figures  in  this  paper  provide  templates  for 
converting  commonly  encountered  two-factor  tables  in  to 
plots.  More  cases  are  covered  in  Carr  (1994).  The  examples 
have  face  validity.  If  they  appear  better  than  numerous 
alternatives  then  likely  they  are  better.  However  the 
examples  have  not  been  subject  to  rigorous  cognitive  tests 
and  the  plot  that  cannot  be  improved  is  exceedingly  rare. 
Readers  are  welcome  to  develop  their  own  templates  for 
converting  table  to  plots.  Those  wanting  to  modify  or  use 
the  current  templates  can  build  upon  the  current  work.  The 
data.  Spins  functions  and  script  files  are  in 
/pub/submissions/rowplot  on  galaxy.gmu.edu  and  can  be 
obtained  by  anonymous  ftp. 
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Abstract 

Probabilistic  graphical  models  (directed  and  undirected 
Markov  fields,  and  combined  in  chain  graphs)  are  used 
widely  in  expert  systems,  image  processing  and  other  ar¬ 
eas  as  a  framework  for  representing  and  reasoning  with 
probabilities.  They  come  with  corresponding  algorithms 
for  performing  probabilistic  inference.  This  paper  dis¬ 
cusses  an  extension  to  these  models  by  Spiegelhalter  and 
Gilks,  plates,  used  to  graphically  model  the  notion  of  a 
sample.  This  offers  a  graphical  specification  language  for 
representing  data  analysis  problems.  When  combined 
with  general  methods  for  statistical  inference,  this  also 
offers  a  unifying  framework  for  prototyping  and/or  gen¬ 
erating  data  analysis  algorithms  from  graphical  specifi¬ 
cations.  This  paper  outlines  the  framework  and  then 
presents  some  basic  tools  for  the  task:  a  graphical  ver¬ 
sion  of  the  Pitman- Koopman  Theorem  for  the  exponen¬ 
tial  family,  problem  decomposition,  and  the  calculation 
of  exact  Bayes  factors.  Other  tools  already  developed, 
such  as  automatic  differentiation,  Gibbs  sampling,  and 
use  of  the  EM  algorithm,  make  this  a  broad  basis  for  the 
generation  of  data  analysis  software. 


Figure  1:  Simple  unsupervised  learning,  with  general 
prediction 


the  diamond,  indicates  that  subsequent  prediction  accu¬ 
racy  is  the  goal  of  learning,  while  the  contents  of  the  plate 
(the  large  box  around  the  nodes  for  class,  vari,  var2  and 
vars)  indicates  that  a  sample  of  N  values  of  vari,  var2 
and  vars  are  given,  because  they  are  shaded,  while  class 
is  hidden,  being  unshaded.  The  plate  indicates  that  its 
contents  are  replicated  N  times,  yielding  a  product  J][ 
in  the  probability  form.  A  legend  for  graphical  models 
used  in  this  paper  appears  in  Figure  2. 


Introduction 

This  paper  argues  that  the  data  analysis  tasks  of  learning 
and  knowledge  discovery  can  be  handled  using  graphical 
models  [11],  This  meta-level  use  of  graphical  models  was 
first  suggested  by  Spiegelhalter  and  Lauritzen  in  the  con¬ 
text  of  learning  probabilities  for  Bayesian  networks.  An 
extension  of  the  standard  graphical  model  is  used  here 
that  allows  this  kind  of  learning  to  be  represented.  The 
extension  is  the  notion  of  a  plate  introduced  by  Spiegel¬ 
halter  and  GilksGilks.etal.stat.  Plates  allow  samples  to 
be  represented  explicitly  on  the  graphical  model,  and 
thus  reasoned  about.  This  makes  data  analysis  problems 
explicit  in  much  the  same  way  that  utility  and  decision 
nodes  are  used  for  decision  analysis  problems. 

Consider,  for  instance.  Figure  1.  This  presents  a  situ¬ 
ation  where  a  mixture  model  with  hidden  variable  class 
is  used  for  subsequent  prediction  of  vari  from  var2  and 
vars.  The  part  to  the  left  of  the  parameters  6  and  (j>  is 
the  graphical  representation  of  a  sample,  and  the  part  to 
the  right  represents  the  prediction  task.  The  value  node. 


Node  for  unknown  variable 
(unshaded  means  unknown). 

Node  iorknown  variable  vari 
(shaded  means  known). 


I  Box  is  an  action  node.  We  can 

I  3et  this  value  ourselves. 


Diamond  is  a  value/utility  node. 
We  would  like  to  achieve  a  high 
value  for  this  variabla 


Plate  around  graph  component  implies  it 
is  repeated  Ntimes^  so  we  have 

ni=i,  pCvaiil  e) 


Double  node  Implies  variable  has 
deterministic  equation:  var  -  f(class,  6) 


Set  of  arcs  into  node  implies  probability 
has  component:  pfvarj  1 4>,  0i ) 


Undirected  arc  implies  variables  are 
correlated. 


Figure  2:  A  legend  for  graphical  symbols 


A  general  approach  to  the  design  of  learning  and  data 
analysis  algorithms  now  becoming  widespread  is  one  of 
engineering  using  principles  of  probability.  An  example 
is  given  in  [3]  where  decision  tree  algorithms,  made  pop¬ 
ular  by  the  CART,  ID3  and  C4.5  programs,  are  devel¬ 
oped  from  basic  probability  principles.  The  basic  tools 
of  probabilistic  (Bayesian)  inference  used  for  this  type 
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of  process  are  reviewed,  for  instance,  by  Tanner  [10]  and 
Kass  and  Raftery  [9]:  various  exact  methods,  Markov 
chain  Monte  Carlo  methods  such  as  Gibbs  sampling,  the 
EM  algorithm,  and  the  Laplace  approximation.  With 
creative  combination,  these  are  able  to  address  a  wide 
range  of  data  analysis  problems.  Gilks,  Spiegelhalter 
and  Thomas  have  taken  this  process  a  step  further  by  de¬ 
veloping  a  compiler  that  generates  Gibbs  samplers  from 
graphical  specifications  [8].  This  handles  a  surprisingly 
broad  number  of  statistical  tasks. 

It  is  the  thesis  of  this  paper  that  these  techniques  are 
now  sufficiently  well  developed  so  that  software  support 
can  be  provided  for  their  use  in  data  analysis  problems. 
That  is,  we  are  now  able  to  generate  components  of  data 
analysis  algorithms,  and  even  entire  algorithms  them¬ 
selves  from  high-level  specifications.  More  details  of  this 
general  capability  can  be  found  in  [2].  A  software  gener¬ 
ator  needs  two  parts  to  make  it  worL 

Language  to  specify  problems: 
probabilistic  graphical  models  (chain  graphs  [11])  ex¬ 
tended  with  plates  are  used  as  a  specification  language. 
When  augmented  with  specific  functional  forms  such 
as  the  Gaussian  and  the  logistic,  this  language  is  suffi¬ 
cient  powerful  to  represent  a  broad  range  of  problems 
across  several  fields:  generalized  linear  models,  feed¬ 
forward  networks,  Jordan  and  Jacobs  mixture  of  ex¬ 
perts,  unsupervised  learning  of  many  different  kinds, 
and  hybrids  of  these  models.  A  simple  connectionist 
feed-forward  network  and  its  corresponding  Bayesian 
network  is  given  in  Figure  3(a)  and  (b)  respectively. 
The  Bayesian  network  represents  the  feed-forward  net- 


Figure  3:  A  simple  feed-forward  network:  (a)  in  native 
form  (b)  as  a  DAG 


work  using  deterministic  nodes  and  then  tacks  on  an 
error  model  at  the  end  of  the  network  to  indicate  that 
the  measured  response  variables  are  not  determinis¬ 
tic  functions  of  the  inputs.  The  feed-forward  network 
in  this  configuration  therefore  computes  means  of  a 
Gaussian. 

Algorithm  schemas:  these  are  templates  for  high- 
level  algorithms  prior  to  code  generation  and  compi¬ 
lation. 

♦  Gilks  ei  al  [8]  have  developed  general  algorithms  to 
perform  Gibbs  sampling  on  Bayesian  networks  with 
plates. 


•  Other  algorithms  such  as  conjugate  gradient, 
Fisher’s  scoring  method,  or  Laplace  approximations 
[9]  can  be  applied  once  first  and  second  derivatives 
are  calculated  for  model  parameters. 

•  The  automatic  calculation  of  derivatives  on  struc¬ 
tures  is  a  well  understood  problem.  In  neural  net¬ 
works,  this  corresponds  to  the  Back-propagation  al¬ 
gorithm  and  its  extensions  for  second  derivatives. 
Likewise,  the  calculation  of  derivatives  on  proba¬ 
bilistic  graphical  models  is  an  application  of  the 
chain  rule  for  differentiation.  Details  appear  in  [2]. 

•  The  more  general  application  of  the  EM  algorithm 
for  hidden  variables  is  obvious. 

Component  libraries:  Almond  ei  al  [1]  point  out 
that  parts  of  a  graph,  components^  are  often  shared 
in  a  series  of  applications.  Learning  and  data  analysis 
are  no  different.  One  useful  component  is  the  gener¬ 
alized  linear  model  which  can  include  basis  function 
sets  for  orthogonal  polynomials  or  wavelets. 

We  can  see  that  many  parts  of  this  ambitious  plan,  a 
software  tool  kit  for  data  analysis,  are  already  in  place. 
The  plan  needs  to  be  qualified,  however.  Proponents 
of  Gibbs  sampling,  for  instance,  say  that  the  design  of 
an  efficient  sampler  takes  care  and  experience.  Specific 
matrix  forms  might  be  used  to  advantage.  It  is  often  the 
case  that  some  fine  tuning  is  needed  in  algorithms.  The 
aim  here  is  to  provide  tools  for  software  engineering,  not 
complete  packaged  solutions. 

One  task  that  can  never  have  direct  software  support  is 
the  design  of  an  appropriate  model  with  an  appropriate 
prior.  This  is  a  knowledge  elicitation  problem.  Tech¬ 
niques  here  are  varied  and  range  from  careful  choice  of 
the  representation  to  simplify  elicitation,  to  techniques 
for  working  with  components  and  libraries  [1].  But  the 
elicitation  task  still  has  to  be  done  afresh  with  each  dif¬ 
ferent  problem,  except  in  those  prototypical  situations 
that  are  routinely  addressed  by  standard  statistical  pack¬ 
ages.  While  one  might  use  a  standard  package  in  initial 
modeling,  as  the  problem  becomes  better  understood, 
specific  requirements  are  needed  that  canned  software 
may  not  provide.  Of  course,  tools  for  software  gener¬ 
ation  alleviate  the  modeling  task  greatly  by  providing 
rapid  prototyping.  Nevertheless,  it  is  my  view  that  a  siz¬ 
able  burden  in  the  Bayesian  analysis  of  data  is  software 
engineering  rather  than  the  statistical  analysis  itself,  and 
therefore  software  generators  and  support  tools  are  both 
a  realistic  and  important  goal. 

In  this  paper  we  discuss  a  few  more  pieces  for  this 
general  software  toolkit.  The  first  is  an  algorithm  for 
the  decomposition  of  a  chain  graph  with  plates  into  in¬ 
dependent  components.  This  technique  has  been  used 
to  develop  efficient  algorithms  for  learning  Bayesian  net¬ 
works  from  complete  data  [4],  The  second  contribution 
is  some  exact  algorithms  on  graphical  models  with  a  sin¬ 
gle  plate.  Both  these  simplify  calculation  of  the  Bayes 
factor  for  a  model,  used  widely  in  Bayesian  methods  [9]. 
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The  Bayes  factor  is  the  support  given  to  model  M2  rel¬ 
ative  to  model  Mi  by  the  data  sample. 


Bayes- factor{M2i  Mi) 


p(samp/e  I  M2) 
p(samp/e  I  Ml) 


We  use  the  term  evidence  for  the  basic  component, 


evidence{M)  =  p{sample\M) 


and  consider  its  calculation  throughout. 

While  these  techniques  can  be  used  in  many  places  in  a 
learning  toolkit,  one  interesting  by-product  is  that  they 
show  how  to  develop  algorithms  for  learning  DAGs  from 
complete  data  where  the  conditional  distributions  are  in 
the  exponential  family,  including  mixtures  of  Gaussians, 
Poissons,  discrete  variables,  etc.  All  that  is  required  is 
a  conjugate  prior.  While  this  capability  should  not  be 
surprising  — and  perhaps  the  hardest  part,  appropriate 
priors,  is  left  out — it  is  interesting  that  we  can  construct 
these  algorithms  automatically  using  the  operations  pre¬ 
sented  here.  More  recent  work  has  focused  on  the  devel¬ 
opment  of  priors  and  their  use  in  the  broader  scheme  of 
things. 


Exact  algorithms  on  graphs  with  plates 

The  removal  of  a  plate  from  a  graphical  model  requires 
conditions  that  are  well  known  in  statistics.  The  problem 
reduces  to  the  existence  of  sufficient  statistics  giving  a 
graphical  version  of  the  Pitman-Koopman  Theorem  from 
statistics. 

Comment  1  (Plate  removal).  Consider  the  model  M 
represented  by  the  graphical  model  for  a  sample  of  size 
N  given  in  Figure  4(^)f  where  x  is  in  the  domain  X 


Figure  4:  The  generalized  graph  for  plate  removal 


and  y  is  in  the  domain  Y,  both  independent  of  6,  and 
both  domains  have  components  that  are  real  valued  or  fi¬ 
nite  discrete.  Let  the  conditional  distribution  for  x  given 
y,0  be  f{x\y^6),  which  is  positive  for  all  x  E  X.  If  first 
derivatives  exist  w.r.t.  all  real  valued  components  of  x 
and  y^ ,  the  plate  removal  operation  applies  for  all  sam¬ 
ples  x^  =  xi,...,XN,  y*  =  and9,  as  given  in 

Figure  4(h)^  for  some  sufficient  statistics  T(aj*,  t/#)  of  di¬ 
mension  independent  of  N  if  and  only  if  the  conditional 
distribution  for  x  given  y,  6  is  in  the  exponential  family^ 

have  yet  to  find  a  clear  development  of  this.  Usually, 
y  isn’t  included  in  the  classic  treatment,  but  we  need  it  here 
and  it  works. 


with  form 

p{x\y,e,M)  =  ,  (1) 

for  some  functions  Wi,  U,  h  and  Z  and  some  integer  k. 
In  this  case,  T(x*,y*)  is  an  invertible  function  of  the  k 
averages 


j=i 


Graph  decomposition 

Learning  problems  can  be  decomposed  into  sub- 
problems  in  some  cases.  For  instance,  consider  the  learn¬ 
ing  problem  given  in  Figure  5  over  two  multinomial  vari¬ 
ables  uari  and  var2y  and  two  Gaussian  variables  xi  and 
X2^  For  this  problem  we  have  specified  two  alternative 
models,  model  Mi  and  model  M2.  Model  M2  has  an 


Model  =  M| 


Model  =  M2 


additional  arc  going  from  the  discrete  variable  i;ar2  to 
the  real  valued  variable  xi.  We  will  use  this  subsequently 
to  discuss  local  search  of  these  models  evaluated  by  their 
Bayes  factor. 

A  straight  forward  manipulation  of  the  conditional  dis¬ 
tribution  for  this  model  yields,  for  model  Mi,  the  condi¬ 
tional  distribution  given  in  Figure  6.  When  parameters. 


Figure  6:  A  simplification  of  model  Mi 


01,  02,  etc.,  are  a  priori  independent,  and  their  data 
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likelihoods  do  not  introduce  cross  terms  between  them, 
the  parameters  become  a  posteriori  independent  as  well. 
This  occurs  for  ^1,^2,  and  the  set  {;/i,<7i}.  This  model 
simplification  also  implies  the  evidence  for  model  Mi  de¬ 
composes  similarly.  Denote  the  sample  of  the  variable  xi 
as  . . . ,  and  likewise  for  vari  and  var^^ 

etc.  In  this  case,  we  get, 

evidence{Mi)  =  p{vari^^\Mi)p{var2^^\vari^^,Mi)  (2) 

van,*,  Afi)  • 

The  evidence  for  model  M2  is  similar  except  that  the 
posterior  distribution  of  pi  and  <7i  is  replaced  by  the 
posterior  distribution  for  p[  and  <7^ . 

This  result  is  general,  and  applies  to  both  DAGs,  undi¬ 
rected  graphs,  and  more  generally  to  chain  graphs.  Simi¬ 
lar  results  results  are  covered  by  Dawid  and  Lauritzen  [7] 
for  a  family  of  models  they  call  hyper-Markov.  The  gen¬ 
eral  result  described  above  is  an  application  of  the  rules 
of  independence  applied  to  plates.  This  uses  a  notion 
of  local  dependence,  which  is  called  the  Markov  blan¬ 
ket.  The  Markov  blanket  is  a  node^s  parents,  children, 
and  the  children’s  parents.  If  deterministic  nodes  are 
involved,  the  definition  requires  a  bit  more  care  [2]. 

To  perform  the  simplification  depicted  in  Figure  6,  it 
is  suflScient  then  to  find  the  finest  partitioning  of  the 
model  parameters  such  that  they  are  independent.  The 
decomposition  in  Figure  6  represents  the  finest  such  par¬ 
tition  of  model  Mi .  The  evidence  for  the  model  will  then 
factor  according  to  the  partition,  as  given  for  model  Mi 
in  Equation  (2).  For  this  task  we  have  the  following 
theorem. 

Theorem  1  (Decomposition),  A  model  M  is  repre¬ 
sented  by  a  chain  graph  G  with  plates  and  no  determin¬ 
istic  nodes.  Let  the  variables  in  the  graph  be  X,  We 
have  P  possibly  empty  subsets  of  the  variables  X,  Xi 
fori  —  such  that  unknown{Xi)  is  a  partition 

of  unhnown{X).  This  induces  a  decomposition  of  the 
graph  G  into  P  subgraphs  Gi  where: 

•  the  graph  Gi  contains  the  nodes  Xi  and  any  arcs  and 
plates  occurring  on  these  nodes;  and 

♦  the  potential  functions  for  cliques  in  Gi  are  equivalent 
to  those  in  G. 

The  induced  decomposition  represents  the  unique  finest 
equivalent  independence  model  to  the  original  graph  if 
and  only  if  Xi  for  z  =  1, . . . ,  P  is  the  finest  collection  of 
sets  such  that,  when  ignoring  plates,  for  every  unknown 
node  u  in  Xi,  its  Markov  blanket  is  also  in  Xi,  This 
finest  decomposition  takes  0{\X\^)  to  compute.  Further¬ 
more,  the  evidence  for  M  now  becomes  a  product  over 
each  subgraph, 

evidence{M)  =  /o  JI/i(*nou;n(X,-,.))  ,  (3) 

i 

for  some  functions  fi  (given  in  the  proof). 


In  some  cases,  the  functions  /,•  have  a  clean  interpre¬ 
tation:  they  are  equal  to  the  evidence  for  the  subgraphs. 
This  result  can  be  obtained  from  the  following  corollary. 

Corollary  1.1  In  the  context  of  Theorem  1  where 
there  are  no  deterministic  nodes,  suppose  there  exists 
a  set  of  chain  components  tj  from  the  graph  ignor- 
ing  plates  such  that  Xj  =  tj  U  parent$(^Tj),  where 
unknown[parents{Tj))  =  0.  Then 

fj{known{Xj^^))  =  p{known{Tj)^\parents{Tj)^,M)  . 

When  deterministic  nodes  exist,  this  is  altered  by  re¬ 
defining  the  notion  of  parent  [2]. 

If  we  denote  the  ;-th  subgraph  by  model  Mj ,  then  the 
probability  term  in  the  corollary  is  the  conditional  evi¬ 
dence  for  model  Mj  given  parents{Tj)^,  Denote  by  Mq 
the  subgraph  on  known  variables  induced  by  cliqueso  (as 
given  in  the  proof  [2]).  If  the  condition  of  Corollary  1.1 
holds  for  Mj  for  i  -  0, 1, . . . ,  P,  then  it  follows  that  the 
evidence  for  the  model  M  is  equal  to  the  product  of  the 
evidence  for  each  subgraph. 

p 

evidence{M)  =  Y]^evidence{Mi)  .  (4) 

1=0 

This  holds  in  general  if  the  original  graph  G  is  a  DAG, 
as  used  in  learning  DAGs  [4]. 

Corollary  1.2  Equation  (4)  holds  if  the  parent  graph  G 
is  a  DAG  with  plates. 

In  general,  we  might  consider  searching  through  a  fam¬ 
ily  of  graphical  models.  To  do  this  we  can  use  standard 
methods  such  as  local  search  or  numerical  optimization 
to  find  high  posterior  models,  or  Markov  chain  Monte 
Carlo  methods  to  select  a  sample  of  representative  mod- 
els  [2].  To  do  this,  we  first  show  how  to  represent  a  family 
of  models.  Figure  7,  for  instance,  is  similar  to  models  of 
Figure  5  except  that  some  arcs  are  hatched.  We  use  this 


Figure  7:  A  family  of  models  (optional  arcs  hatched) 

to  indicate  that  these  arcs  are  optional.  To  instantiate 
a  hatched  arc  they  can  either  be  removed,  or  replaced 
with  a  full  arc.  This  graphical  model  then  represents 
many  different  models,  for  all  2^  possible  instantiations 
of  the  arcs.  Prior  probabilities  for  these  models  could 
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be  generated  using  a  scheme  such  as  in  [4,  p54]  where 
a  prior  probability  is  assigned  by  a  domain  expert  for 
the  inclusion  of  each  arc,  and  the  prior  for  a  full  model 
found  by  multiplication.  The  family  of  models  given  by 
Figure  7  includes  those  of  Figure  5  as  instances.  During 
search  or  sampling,  an  important  property  is  the  Bayes 
factor  for  the  two  models,  Bayes- factor{M2y  Mi).  Be¬ 
cause  of  the  decompositions  above,  the  Bayes  factor  can 
be  found  by  only  examining  component  Bayes  factors  for 
nodes  whose  parents  have  changed  between  models  Mi 
and  M2.  The  difference  here  is  the  model  for  the  variable 


xi. 

Bayes- factor(M2i  Ml) 


p{xi^^\vari^^ ,  var2,*yM2) 


That  is,  the  Bayes  factor  can  be  computed  from  only 
considering  the  models  involving  and 

This  incremental  modification  of  evidence,  Bayes  fac¬ 
tors,  and  finest  decompositions  is  also  general,  and  fol¬ 
lows  directly  from  the  independence  test.  It  has  been 
used  in  fast  learning  algorithms  for  DAGs  [4].  This  is 
developed  below  for  the  case  of  directed  arcs  and  non- 
deterministic  variables. 

Lemma  1  For  a  graph  G  in  the  context  of  Theo¬ 
rem  1  with  no  deterministic  nodes,  we  have  two  vari¬ 
ables  U  and  V  such  that  U  is  given.  Consider 
adding/removing  a  directed  arc  from  U  to  V .  We  up¬ 
date  the  finest  decomposition  of  G  as  follows:  There 
is  a  unigue  subgraph  containing  the  unknown  variables 
in  parents{chain-component(y)).  To  this  subgraph 
add/deleie  an  arc  from  U  to  and  add/delete  U  to 
the  subgraph  if  required. 

We  can  therefore  add  shaded  non-deterministic  par¬ 
ents  at  will  to  nodes  in  a  graph  and  the  finest  decom¬ 
position  remains  unchanged  except  for  a  few  additional 
arcs.  The  use  of  hatched  arcs  in  these  contexts  there¬ 
fore  causes  no  additional  trouble  to  the  decomposition 
process.  That  is,  we  form  the  finest  decomposition  for 
a  graph  with  plates  and  hatched  directed  arcs  as  if  the 
arcs  were  normal  directed  arcs,  and  the  evidence  is  ad¬ 
justed  during  the  search  by  adding  the  different  parents 
as  required. 


Bayes  factors  for  the  exponential  family 

The  above  results  are  useful,  but  to  make  use  of  them 
automatically  we  need  to  be  able  to  generate  Bayes  fac¬ 
tors  or  evidence  for  models.  It  generally  holds  that  if  a 
likelihood  is  in  the  exponential  family,  then  the  posterior 
distribution  for  the  model  parameters  is  also  in  the  expo¬ 
nential  family,  although  it  is  only  really  useful  when  the 
normalizing  constant  is  readily  computed.  This  holds 
for  the  Dirichlet,  the  conjugate  to  a  multinomial,  and 
the  Gaussian- Wishart,  the  conjugate  to  a  Gaussian.  We 
give  the  results  here. 

Let  the  normalizing  constant  for  the  conjugate  distri¬ 
bution  for  Comment  1  be  Ze{r),  and  let  the  normalizing 


constant  for  the  distribution  be  Zi(6)Z2  where  Z2  is  a 
constant  part  independent  of  9,  then  the  Bayes  factor 
can  be  readily  computed.  This  is  a  common  trick  used 
widely  by  Bayesians,  however,  I  have  never  seen  it  stated 
explicitly. 

Lemma  2  Consider  the  context  of  Comment  1. 
Then  the  model  likelihood  or  evidence,  given  by 
evidence(M)  =  p{xt, . .  .,xjf\yi, . .  .,yN,  M),  can  be 
computed  as: 


evidence{M) 


p{e\T') 


-  ^^(^0 
ZeiT)Z^  ' 
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Abstract 

In  two-sample  studies  with  ordinal  responses,  the 
Wilcoxon  rank-sum  test  is  commonly  chosen  to  test 
equality  of  the  distributions,  Hq  :  Fi  =  ^2?  in  spite  of 
its  being  a  test  of  the  specific  hypothesis  of  location-shift 
between  the  distributions.  Unless  a  specific  alternative 
is  hypothesized,  use  of  an  omnibus  test  instead  should 
maximize  power.  We  compare  the  power  of  the  exact 
tests  based  on  the  omnibus  classical  Smirnov  statistic 
with  that  based  on  the  Wilcoxon  rank-sum  statistic  un¬ 
der  various  alternatives,  including  shift  in  location.  To 
compute  exact  power  we  use  the  methods  described  by 
Hilton  and  Mehta  (1993)  and  Mehta,  Patel  and  Tsiatis 
(1984).  These  algorithms  are  especially  useful  in  evalu¬ 
ating  the  Smirnov  test  because  its  asymptotic  non-null 
distribution  has  not  been  defined.  Specific  examples  as 
well  as  results  of  a  simulation  study  are  presented. 

Introduction 

When  two-sample  data  with  categorical  responses  are 
analyzed,  if  the  responses  are  ordinal  then  the  Wilcoxon 
rank-sum  test  (Wilcoxon,  1945)  is  commonly  chosen  to 
test  equality  of  the  distributions,  Hq  \  Fi  =:  F2.  The 
Wilcoxon  rank-sum  statistic  is  specifically  sensitive  to 
the  hypothesis 

iJo  :  =  F{x  —  A), 


where  A  represents  a  location-shift  between  the  distri¬ 
butions  of  responses.  However,  especially  in  categorical 
data,  little  may  be  known  about  the  types  of  differences 
that  occurs  between  distributions,  in  which  case  an  om¬ 
nibus  test  should  generally  increase  power.  For  example, 
the  responses  may  differ  in  scale,  r,  as  well  as  in  location, 

A  candidate  omnibus  test  for  two-sample  ordered  cat¬ 
egorical  data  is  that  based  on  the  Smirnov  statistic 
(Smirnov,  1939). 

Attempting  to  obtain  high  power,  Eplett  (1982) 
proposed  a  statistic  that  is  the  sum  of  the  Wilcoxon 
and  Smirnov  statistics  and  evaluated  its  power  under 
location  and  scale  alternatives.  He  showed  that  “for 
light-tailed  distributions”  his  test  becomes  progressively 
more  powerful  compared  with  the  Smirnov  test  as  the 
scale-change  part  of  the  hypothesis  becomes  more  dom¬ 
inant.  When  one  of  the  two  distributions  was  uniform 
over  [0, 1],  the  power  of  the  tests  based  on  these  two 
statistics  were  similar  for  m  =  n  =  50.  More  recently, 
O’Brien  (1988)  and  Blair  and  Morel  (1992)  evaluated 
four  tests  in  the  presence  of  location  and  scale  changes 
in  continuous  data:  Wilcoxon’s  test.  Student’s  i  test, 
and  O’Brien’s  generalized  versions  of  these  tests  (1988). 
The  generalizations  were  defined  to  increase  the  sensi¬ 
tivity  of  the  tests  to  scale  changes.  However,  Blair  and 
Morel  (1992)  found  that  “heterogeneity  of  patient  re¬ 
sponse  (scale  change)  does  not  always  lead  to  power  ad- 
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vantage  for  the  unconditional  generalized  tests.”  Thus, 
at  least  in  the  continuous  response  data  realm,  the  need 
for  a  test  that  is  sensitive  to  a  broad  range  of  alternatives 
has  not  yet  been  satisfied. 

Here,  we  compare  the  power  of  the  Smirnov  test 
with  that  of  the  Wilcoxon  test,  the  standard  in  prac¬ 
tice,  under  alternatives  that  include  changes  in  location 
and/or  scale.  The  underlying  data  are  ordered  categor¬ 
ical.  Because  asymptotic  power  formulae  that  account 
for  categorical  data  do  not  exist,  we  evaluate  the  exact 
power  of  these  tests.  Mehta,  Patel  and  Tsiatis  (1984)  and 
Hilton  and  Mehta  (1993)  described  methods  for  testing 
or  finding  power  of  exact  tests  which  are  illustrated  via 
the  Wilcoxon  statistic.  Hilton,  Mehta  and  Patel  (1994) 
and  Nikiforov  (1994)  have  recently  reported  algorithms 
for  conducting  exact  Smirnov  tests  for  continuous  or  cat¬ 
egorical  data.  Method  are  presented  for  computing  exact 
power  and  for  modeling  alternatives  of  interest  between 
the  distributions  of  the  two  groups.  Finally,  we  explore 
the  relative  power  of  the  Wilcoxon  and  Smirnov  tests 
against  a  range  of  these  alternatives. 


Methods 


Let  X  =  {xi,.,,,XK)y  =  m,  and  x'  = 

(a?i,...,a:^),  ~  represent  two  samples  of  re¬ 

sponses  from  multinomial  distributions  with  parameters 
{iri,...,irK),  E’Tj  =  1.  and  (xi, . . =  1, 
respectively.  Denote  the  combined  data  hy  tj  =  Xj  H- 
Xj,  j  =  1, . . . ,  Jf ,  where  K  is  the  number  of  distinct  cat¬ 
egories  in  the  combined  sample.  Then  the  probability 
of  a  particular  permutation  of  the  data,  conditional  on 
t  =  (<i, . . . ,  Ik)  is  given  by  the  generalized  hypergeomet¬ 
ric  distribution  (Lehmann,  1975), 


Pr{X  =  xlt}  = 


("'5  ft 

""-'Till -hr — M 

y  Vi-  (<;•  -  %•)! 


where  x,  y  €  P^,  the  set  of  all  such  permutations: 

K  K 

Pt  =  {x  :  X/  X  -b  x'  =  t}. 

;=1  i=l 

Under  an  alternative  hypothesis  tt'  specifies  a 
particular  alternative  of  interest.  Then  the  power  of,  say, 
the  test  based  on  the  Wilcoxon  statistic,  is 

/?tH  =  Pr{W>w\t;HA}  (1) 

=  ^  Pr{X  =  x\t;HA}, 


where  T^{w)  =  {x  €  Pt  :  Y^j  i  = 

Similarly,  the  test  could  be  based  on  the  Smirnov  statis¬ 
tic,  in  which  case  the  critical  region  would  be  Pt(s)  = 

{x  E  Pt  :  maxy[F;n(i)  -  Pn(i)]  >  i  = 

where  Fm(j)  =  empirical  distribution 

function. 

Since  power  can  be  computed  conditionally  for  all 
margins  t  (1),  exact  unconditional  power  can  be  ob¬ 
tained  as  the  expected  value  of  these  terms, 

/?H  =  ^  l3tiy^)Pr{T  =  t;  Ha},  (2) 

ten 

where  =  {t  :  and  Pr{T  =  t;  Ha}  = 

Yxer^  Pr{X  =  x,X'  =  x';  Ha}-  In  theory  obtaining 

(2)  is  clearly  not  difficult,  but  in  practice  it  is  because 
the  size  of  Q  can  be  quite  large.  For  example,  for  X  =  5 
and  m  -f  n  =  50,  n  contains  316,251  distinct  vectors  t. 

To  reduce  the  computational  burden  one  can  in¬ 
stead  estimate  exact  power  from  a  sample  of  given 
m  and  n.  Hilton  and  Mehta  (1993)  described  a  Monte 
Carlo  estimator  of  exact  power, 

=  (3) 

i=:l 

and  reported  its  high  efficiency  relative  to  the  usual 
Monte  Carlo  estimator  when  using  the  Wilcoxon  statis¬ 
tic  in  5-category  data. 


Modeling  alternatives 


To  account  for  the  ordering  of  the  responses,  for  group  1 
define  the  cumulative  probability  of  responding  in  cat¬ 
egories  1  through  j  as  =  tti  -b  7r2  + - b  and 

define  the  corresponding  cumulative  number  of  subjects 
responding  in  categories  1  through  j  as  m j  =  mj-i  +  Xj, 
j  =  1, . . . ,  if ,  where  mo  =  0  and  rriK  =  m.  For  group  2 
define  y}  and  nj,  j  =  0, . . . ,  if,  analogously.  The  cumu¬ 
lative  probabilities  are  useful  in  specifying  the  distribu¬ 
tions  under  alternative  hypotheses. 

To  simplify  the  problem  of  specifying  alternatives 
in  ordered  categorical  data,  we  find  an  extension  of  the 
proportional  odds  model  (McCullagh,  1980)  useful: 


. 


(4) 


where  A,r  €  (—00,00),  and  A  =  0,  r  =  0  represents 
the  null  case.  The  model  reduces  the  2(if  —  1)  possi¬ 
ble  parameters  to  K  —  I  nuisance  parameters,  7^,  j  = 
1, . . .  jif'  —  1,  and  two  parameters  of  interest,  A  and  r. 
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The  nuisance  parameters  might  represent,  for  example, 
the  distribution  of  the  control  group  whose  values  can 
be  obtained  from  previous  research. 


(a)  71 -(.2,  .2,  .2,  .2,  .2) 


1  2  3  4  5 

category 

(b)y  =  (.2,.4,.6,.8,1.0) 


category 


Figure  1.  Distributions  arising  from  equation  (4)  for 
three  combinations  of  (A,  r)  as  a  function  of  (a)  tt  = 
(.2,  .2,  .2,  .2,  .2)  and  (b)  7  =  (.2,  .4,  .6,  .8, 1.0). 


Figure  1  illustrates  some  distributions  that  can 
arise  from  this  model.  Clearly,  a  rich  field  of  alterna¬ 
tives  can  be  specified  through  such  a  model,  against 
some  of  which  the  Wilcoxon  test  may  be  more  sensitive 
(A  changes)  and  others  the  Smirnov  test  may  be  more 
sensitive  (r  changes). 

Example 

Lesaffre,  Scheys,  Frolich  and  Bluhmki  (1993)  described 
the  problem  of  calculating  sample  size  in  studies  with 
bounded  outcome  scores.  Their  responses  fell  into  21 
categories,  obtained  by  collapsing  a  continuous  0  -  100 
scale,  with  high  probabilities  in  the  first  and  last  cat¬ 
egories.  They  note  that  when  the  data  have  a  U- 
shaped  or  J-shaped  distribution,  the  assumptions  un¬ 
derlying  Lehmann’s  method  of  determining  power  via 
the  Wilcoxon  statistic  are  not  met.  This  method  indi¬ 
cates  that  120  subjects  per  group  are  needed  to  detect 
a  standardized  difference  of  .38  with  80%  power  using 
a  two-sided  .05-level  Wilcoxon  statistic;  we  add  that  95 
subjects  per  group  are  needed  using  a  one-sided  test. 

Their  21-point  scale  data  are  shown  in  Figure  2. 
Using  the  estimator  described  in  (3)  with  JV  =  5,  we 
estimated  that  the  exact  two-sided  Wilcoxon  power  was 
47.1  ±  1.1%  and  Smirnov  power  was  44,1  ±  1.1%)  —  far 
less  than  the  80%  obtained  by  the  asymptotic  approxi¬ 
mation. 


Barthel  Index,  categorized 

Figure  2.  Distributions  of  BarthePs  Index  scores  of 
control  and  treated  subjects.  (Modified  from  Lesaffre, 
Scheys,  Frolich  and  Bluhmki  (1993).) 
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Relative  power 

We  compared  more  generally  the  unconditional  power  of 
the  .05-level  one-sided  exact  Wilcoxon  and  Smirnov  tests 
against  location  (A)  and/or  scale  (r)  alternatives  in  or¬ 
dered  categorical  data.  Group  1  data  were  generated 
by  transforming  m  =  50  (or  25)  uniform[0,l]  random 
variates  into  5-category  multinomial(m,  tt)  counts,  where 
^  =  (.2,  .2,  .2,  .2,  .2)  (7  =  (.2,  .4,  .6,  .8, 1.0)).  Group  2 
data  with  n  =  50  (or  75)  were  generated  similarly  us¬ 
ing  (4)  to  specify  7'.  We  evaluated  A=(0,  .1(.2)1.3) 
by  r=  (0,  .2,  .4).  These  alternatives  lead  to  large  val¬ 
ues  of  the  Wilcoxon  and  Smirnov  statistics  when  testing 
Ha  •  7j  <  j  =  1,...,/^  1-  For  each  combinar- 

tion  of  parameters,  N  =  100  two-sample  data  sets  were 
simulated. 

Figure  3  displays  the  relative  power  of  the  Smirnov 
to  the  Wilcoxon  test  in  (a)  balanced  samples  (m  = 
50,  n  =  50)  and  (b)  unbalanced  samples  (m  =  25,n  = 
75),  as  a  function  of  A,  for  three  values  of  r.  The  rel¬ 
ative  power  curves  when  r  =  0  indicate  that  against 
location  alternatives  the  Smirnov  test  was  less  power¬ 
ful  than  the  Wilcoxon  test  in  both  balanced  and  unbal¬ 
anced  samples.  The  relative  power  of  the  Smirnov  test 
was  lowest  at  A  =  0,  which  demonstrated  that  it  was 
more  conservative  (Smirnov  size  =  3.4 ±.07%;  Wilcoxon 
size  =  5.0  ±  .00%).  As  A  moved  away  from  zero,  the 
relative  power  of  the  Smirnov  test  increased.  At  A  =  .9, 
Wilcoxon  power  was  70%  or  81%,  depending  on  balance 
of  samples,  and  Smirnov  power  achieved  .84  —  .86  times 
as  much  power. 

As  T  moved  away  from  the  null  case,  both  tests  had 
greater  power  than  in  the  r  =  0  case  at  all  values  of  A, 
At  A  =  .9,  r  =  .2,  Wilcoxon  power  was  77%  or  88%, 
and  Smirnov  power  achieved  .88  —  ,91  times  these  levels. 

For  T  =  .4,  in  the  absence  of  location  changes  (A  = 
0)  the  Smirnov  test  had  substantially  greater  power.  The 
relative  power  decreased  with  the  increasing  influence  of 
the  location  parameter,  but  remained  >  .95  for  all  A. 
At  A  =  .9,  r  =  .4,  Wilcoxon  power  was  84%  —  92%,  and 
Smirnov  power  was  .98-1.02  times  as  high.  As  expected, 
both  tests  generally  had  greater  power  in  balanced  than 
imbalanced  samples.  In  addition,  balance  in  sample  sizes 
favored  the  Smirnov  test. 

Conclusions 

The  Wilcoxon  rank-sum  statistic  is  commonly  used  to 
analyze  ordered  categorical  data.  However,  we  believe 
that  it  should  not  be  used  without  careful  considera¬ 
tion  of  the  alternative  hypothesis  of  interest,  since  it  was 
specifically  designed  to  test  a  location-shift  between  dis¬ 


tributions.  We  compared  the  power  of  exact  tests  based 
on  the  Wilcoxon  rank-sum  statistic  and  the  Smirnov 
statistic,  focusing  on  the  setting  that  is  optimal  for  the 
Wilcoxon  since  it  is  the  “standard”  for  2  x  K  ordered 
categorical  data.  We  estimated  the  relative  power  of  the 


0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4 
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oi - 
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A 

Figure  3.  Relative  power  of  Smirnov  to  Wilcoxon  test, 
as  a  function  of  location  (A)  and  scale  (r)  changes  be¬ 
tween  distributions,  in  (a)  balanced  and  (b)  unbalanced 
samples.  Nuisance  parameters  are  tt  =  (.2,  .2,  .2,  .2,  .2). 
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tests  against  location-shifts,  with  and  without  inclusion 
of  local  scale  alternatives,  using  the  method  of  Hilton 
and  Mehta  (1993). 

Under  the  null  hypothesis  of  no  difference  between 
two  distributions,  the  Smirnov  test  was  more  conserva¬ 
tive.  This  is  because  its  support,  or  reference  set,  is  more 
discrete  than  that  of  the  Wilcoxon  statistic. 

Against  location-shifts  alone,  the  setting  in  which 
the  Wilcoxon  test  is  optimal,  the  Wilcoxon  test  was 
substantially  more  powerful.  In  contrast,  the  Smirnov 
test  was  very  sensitive  to  scale  changes  alone,  while  the 
Wilcoxon  test  had  negligible  power  in  this  setting. 

For  location-shifts  in  the  presence  of  small  scale 
changes,  there  was  little  difference  in  the  power  of  these 
two  tests.  As  the  influence  of  the  scale  effect  increased, 
so  did  the  relative  power  of  the  Smirnov  test.  Without 
information  on  how  two  ordered  categorical  distributions 
differ,  the  test  based  on  the  Smirnov  statistic  is  the  safer 
choice.  Our  results  held  for  two  very  different  definitions 
of  the  nuisance  parameters  (only  one  shown  here)  and 
in  balanced  and  unbalanced  samples;  they  were  strong 
enough  to  suggest  that  they  hold  fairly  generally. 

The  choice  of  the  vector  of  nuisance  parameters, 
representing  a  control  group,  can  often  be  guided  by 
previous  research.  Specification  of  the  vector  of  param¬ 
eters  in  an  experimental  group  can  be  more  difficult;  we 
proposed  using  a  model,  such  as  the  proportional  odds 
model,  so  that  only  location  and/or  scale  differences  be¬ 
tween  the  distributions  need  be  specified.  The  choice  of 
which  model  to  use  is  somewhat  arbitrary.  Its  selection 
is  akin  to  choosing  to  base  a  power  calculation  for  2  x  2 
data  on  either  the  diflference  between  binomial  proba¬ 
bilities  or  on  the  odds  ratio.  Like  the  influence  of  the 
nuisance  parameters,  the  impact  of  the  model  on  the  al¬ 
ternatives  is  much  less  than  the  impact  of  the  location 
and/or  scale  parameters. 

In  an  example  data  set  with  an  11-point  ordinal  re¬ 
sponse  and  large  probabilities  in  the  extreme  categories, 
we  showed  that  Lehmann’s  asymptotic  approximation 
can  provide  a  very  poor  estimate  of  the  power  of  the 
Wilcoxon  test  for  ordered  categorical  data.  We  don’t  at¬ 
tempt  to  generalize  this  finding,  but  rather  use  it  to  illus¬ 
trate  that  Lehmann’s  approximation  should  be  applied 
to  categorical  data  with  caution.  Another  drawback  of 
asymptotic  methods  for  power  calculations  is  that  the 
Wilcoxon  statistic  can  be  used  but  the  Smirnov  can¬ 
not  -  because  its  asymptotic  non-null  distribution  has 
not  been  defined.  In  contrast,  our  exact  method  accom¬ 
modates  a  variety  of  statistics,  including  the  Wilcoxon 
rank-sum  and  omnibus  Smirnov  statistics,  and  is  most 


efficient  when  samples  are  small  and  response  distribu¬ 
tions  are  discrete. 

In  conclusion,  if  two  ordered  categorical  distribu¬ 
tions  differ  at  all  in  scale,  with  or  without  differences 
in  location,  then  the  Smirnov  statistic  should  be  used 
for  designing  and  analyzing  a  study  of  the  hypothesis  of 
equality  of  distributions. 
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A  SIMULATION  STUDY  OF  SOME  RANK  TESTS  FOR  INTERACTION 

IN  TWO-WAY  LAYOUTS 
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In  this  investigation,  two  versions  of  the  F-statistic 
analogue  of  aligned  rank  test  for  interaction  in  two-way 
layouts  are  studied  along  with  the  classical  F-test  and 
the  rank  transform  test.  The  Wilcoxon  and  the  normal 
score  functions  are  both  considered  in  the  study.  The 
results  from  extensive  simulation  studies  indicate  that 
the  aligned  rank  tests  have  better  performance  in  gen¬ 
eral.  None  of  these  tests  performs  well  for  the  Cauchy 
distribution.  The  use  of  the  normal  rank  score  function 
reduces  the  inflation  in  the  Type  I  error  rate  of  the  rank 
transform  statistic. 

1  INTRODUCTION 

In  the  traditional  approach  for  testing  main  effects  in 
two-way  layouts,  the  existence  of  interaction  needs  to 
be  tested  first.  In  addition  to  the  classical  F-test,  sev¬ 
eral  nonparametric  alternatives  have  been  proposed  by 
Mehra  and  Sen  (1969),  Mehra  and  Smith  (1970),  Bhap- 
kar  and  Gore  (1974),  and  Mansouri  and  Govindarajulu 
(1990).  The  aligned  rank  methods  studied  by  McKean 
and  Hettmansperger  (1976),  Adichie  (1978),  and  Chi- 
ang  and  Puri  (1984)  can  also  be  used  to  form  tests  for 
interaction. 

The  rank  transform  (RT)  method  consists  of  replacing 
the  observations  with  their  rank  among  the  entire  data 
set  and  performing  the  standard  parametric  analysis  of 
variance  (ANOVA)  test  to  these  ranks.  Conover  and 
Iman  (1981)  has  suggested  that  the  RT  approach  can 
be  applied  in  a  variety  of  circumstances  such  as  analysis 
of  experiment  designs,  multiple  regression,  cluster  analy¬ 
sis,  discriminant  analysis.  This  type  of  testing  procedure 
has  many  advantages  over  other  procedures  for  its  less 
strict  distributional  assumptions,  greater  power  and  also 
it  is  simple  to  apply  because  of  the  existing  computer 
software  for  parametric  tests.  In  the  simulation  stud¬ 
ies  by  Iman,  Hora  and  Conover  (1984),  they  show  that 
this  procedure  has  excellent  power  properties  for  test¬ 
ing  main  effects  in  two-way  layouts  without  interaction. 
Hora  and  Conover  (1984)  has  also  found  the  limiting 
null  distribution  of  the  usual  F  statistic  when  applied  to 
ranks  for  testing  main  effects  in  two-way  layouts  without 


interaction.  The  simulation  results  of  Blair,  Sawilowsky 
and  Higgins  (1987)  show  that  the  RT  statistic  is  inappro¬ 
priate  for  testing  interaction  in  two-way  layouts.  Under 
the  assumption  of  normality  and  when  both  main  effects 
are  present,  they  found  a  severe  inflation  in  Type  I  error 
rates  as  main  effects  are  large. 

The  reason  for  the  RT  technique  to  be  so  attractive 
is  their  simplicity,  since  the  classical  F-test  is  available 
in  almost  every  statistical  package.  In  this  paper,  two 
statistics,  aligned  RT  statistic  and  modified  aligned  RT 
statistic,  that  have  the  same  features  of  the  RT  statistic 
are  studied.  They  are  more  powerful  and  robust  then 
the  RT  statistic  and  are  referred  to  as  the  aligned  rank 
transform  statistics.  Through  the  use  of  the  alignment 
technique,  the  inflation  in  Type  I  error  rates  is  over¬ 
come.  The  formulas  for  these  aligned  RT  statistics  are 
presented  in  section  2.  In  section  3,  tables  are  generated 
for  Type  I  error  rates  and  power  analysis.  It  is  concluded 
that  the  aligned  rank  transform  test  has  the  most  robust 
test  among  the  group  of  tests.  None  of  these  tests  per¬ 
forms  well  for  the  Cauchy  distribution, 

2  THE  TEST  STATISTICS 

Let  Xijk  denotes  the  kth  random  observation  from 
the  (i,  j)ih  cell  follow  the  fixed  effects  model: 

Xijk  =  +  Qfi  +  /?i  +  Jij  +  Ujk  (2.1) 

i=  1, k  = 

where  rc*n=Nyfiis  the  overall  mean,  a,-  and  fij  are  the 
ith  row  and  the  jih  column  main  effects,  respectively,  'yij 
is  interactions  between  the  ith  row  and  jih  column,  and 
€ij  k  are  independent  and  identically  distributed  random 
variables  having  a  continuous  distribution  function  F(*). 
We  wish  to  test  the  null  hypothesis 

Hol^ij^Oy  "iiyjy 
against  the  alternative  hypothesis 

Ha  :  7»;  ^  0,  for  at  least  one  (i,i). 
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The  classical  F-statistic  is  given  by 


where 


F  = 


MSINT 
MSB  ’ 


where 


(2.2) 


MSINT  =  •)/(r-l)(c-l), 

.=1  i=i 

(2.3) 


MSB  =  S  £  -  Xij.f/rc{n  -  1),  (2.4) 

»=li=l *=1 


n  N 

aN{Rii  )  -  aN(Rijk)/n,  on  =  «w(«)M 

*=1  a=l 

and  MSE(a)  is  obtained  from  (2)  by  replacing  Xijk  with 
aN{Rij]c).  It  can  be  shown  that  under  Hq  the  limiting 
distributions  of  (r  -  l)(c  -  l)jPa  and  (r  -  l)(c  -  1)F,„ 
are  central  with  (r  ~  l)(c  -  1)  degrees  of  freedom, 
Mansouri  and  Chang  (1994). 

3  SIMULATION  STUDY 


the  combination  of  bar  and  dot  notation  means  that  the 
values  are  averaged  over  the  subscript (s). 

Let  Rijk  denote  the  rank  of  Xijk  among  Xm,-*, 
^^fcn  •  I'he  R/TP  statistic  F  is  obtained  froin 
(2.2)  by  replacing  Xijk  with  the  ranks  or  the  rank  scores 
0'N{Rijk)-  We  consider  the  rank  score  functions  that 
satisfy  the  following  general  assumptions: 

•  The  scores  U;s^(i)  are  generated  by  a  non-constant 
and  square  integrable  function  <j>  defined  on  (0, 1), 
in  one  of  the  following  ways: 

®Ar(0  ^  ^  ~ 

where  1  <  i  <  iV,  and  Ui^N  is  the  ith  order  statistic 
in  a  sample  of  size  N  from  a  uniform  distribution 
defined  on  (0, 1). 

•  The  score  generating  function  ^(u)  on  (0, 1)  is  such 
that 

0  <  =  f  [<^(u)— <^)^du  <  oo,  =  f  <f)(u)du. 

Jo  Jo 

Let  and  denote  some  consistent  esti¬ 
mators  of  and  {PjYj-i  under  Hq,  respectively, 

such  that  -  a,)  and  -  /?,)  are  bounded 

in  probability  for  every  i  and  j.  Let  the  aligned  rank 
Rijk  be  the  rank  of  Xijk  —  ~  among  Xm  —  di  — 

•  •  • , Xrcn  -  &r  -  0c ‘  The  the  aligned  RT  statistic  Fa 
has  the  form  (2.2)  by  replacing  Xijk  with  ajv(^,jjk). 

The  modified  aligned  RT  is  the  F-statistic  version  of 
the  aligned  rank  test  suggested  by  Mansouri  and  Govin- 
darajulu  (1990).  The  statistic  for  the  modified  aligned 
RT  test  is 

r  c 

)-^vv]V[('’-l)(c-l)M5F(a)], 

i=l  i  =  l 


In  the  Monte  Carlo  simulation  studies,  we  present  results 
from  the  design  with  r  =  4  rows  and  c  =  3  columns. 
Each  design  is  replicated  5000  times  to  insure  the  stabil¬ 
ity  of  the  simulated  sampling  distributions  of  the  statis¬ 
tics  that  are  considered.  The  least-squared  estimators 
are  considered  for  d,*  and  0j  in  the  aligned  RT  tests. 

Table  1  contains  the  empirical  Type  I  error  rates  ver¬ 
sus  the  nominal  a  =  .05  for  normal  underlying  dis¬ 
tribution  with  the  main  effects  012  =  0i  =  e,  and 
Ck3  ~  02  ^  and  all  other  effects  equal  to  zero.  The 
Wilcoxon  rank  score  function  is  used.  As  in  Blair  et  al. 
(1987),  a  severe  inflation  in  the  Type  I  error  rates  is  ob¬ 
served  for  the  RT  statistic  as  e  increases.  Whereas 
the  Type  I  error  rates  of  Fa  and  Fm  stay  inside  the  95% 
confidence  bounds  and  behave  nicely. 

The  empirical  power  for  these  statistics  with  interac¬ 
tion  effects  are  presented  in  Table  2.  For  cases  where  the 
empirical  Type  I  error  rates  are  not  much  larger  than  the 
nominal  values,  we  can  see  that  F,  Fa,  and  Fm  have  bet¬ 
ter  empirical  power  than  Fr. 

The  Type  I  error  rates  and  power  of  these  tests  are  also 
simulated  under  different  distributions  to  examine  their 
robustness  properties.  Table  3  and  4  contains  results 
from  exponential  (EXP),  double  exponential  (DEX)  and 
Cauchy  (CAU)  distributions.  None  of  these  tests  per¬ 
forms  well  for  Cauchy.  As  both  of  the  main  effects 
increase,  the  inflation  in  Type  I  error  rates  of  the  RT 
statistic  is  observed  in  every  distribution.  However,  it 
performs  better  than  other  tests  when  the  underlying 
distribution  is  Cauchy.  The  aligned  RT  statistic  has  bet¬ 
ter  performance  in  general.  The  other  rank  tests  seem 
to  have  higher  empirical  power,  but  this  is  due  to  their 
over  inflated  Type  I  error  rates. 

In  table  5,  the  Type  I  error  rates  and  power  of  the  x^ 
version  of  the  aligned  RT  tests,  A  =  (r-  l)(c-  l)Fa  and 
M  =  (r  —  l)(c  —  1)F^,  are  also  presented.  Comparisons 
can  be  made  with  the  results  of  the  F-statistics  in  table 
3  and  4.  The  empirical  Type  I  error  of  the  F-statistics 
converge  faster  than  their  x^  counterparts.  The  empir- 
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ical  power  of  the  F-statistic  is  slightly  less  than  that  of 
the  x^'Statistic. 

When  the  normal  scores  are  used  and  the  underly¬ 
ing  distribution  is  normal,  the  empirical  Type  I  error 
and  power  of  the  aligned  RT  tests  perform  as  well  as  F 
(Mansouri  and  Chang  1994).  Furthermore,  the  inflation 
of  the  empirical  Type  I  error  for  the  RT  test  is  signifi¬ 
cantly  reduced. 


c 

Stat. 

Sample  size 

(n),  a  = 

.05 

5 

10 

20 

50 

0.50 

F 

.050 

.053 

.051 

.052 

Fr 

.055 

.052 

.053 

.057 

Fa 

.053 

.052 

.053 

.050 

Fm 

.056 

.058 

.058 

.055 

1.00 

F 

.048 

.047 

.047 

.047 

Fr 

.057 

.069 

.093 

.189 

Fa 

.051 

.051 

.051 

.045 

Fm 

.055 

.056 

.056 

.051 

1.50 

F 

.047 

.050 

.054 

.053 

Fr 

.071 

.136 

.314 

.848 

Fa 

.050 

.045 

.052 

.053 

Fm 

.056 

.050 

.055 

.058 

2.50 

F 

.047 

.051 

.049 

.056 

Fr 

.192 

.684 

.995 

1.000 

Fa 

.052 

.051 

.048 

.055 

Fm 

.056 

.057 

.053 

.060 

Table  1;  Type  I  Error  Rates  of  Tests  for  Interactions 
when  a2  =  A  =  e,  03  =  /?2  =  -e. 
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c 

Stat. 

EXP 

DEX 

CAU 

0.50 

F 

.052 

.049 

.017 

Fr 

.086 

.048 

.053 

Fa 

.055 

.047 

.463 

Fm 

.212 

.081 

.999 

1.00 

F 

.052 

.054 

.016 

Fr 

.399 

.113 

.059 

Fa 

.053 

.052 

.468 

Fm 

.215 

.084 

1.000 

1.50 

F 

.045 

.045 

.012 

Fr 

.924 

.368 

.093 

Fa 

.052 

.049 

.458 

Fm 

.197 

.079 

.999 

2.50 

F 

.043 

.047 

.018 

Fr 

1.000 

.994 

.261 

Fa 

.046 

.047 

.472 

Fm 

.198 

.073 

.999 

Table  3:  Type  I  Error  Rates  of  Tests  for  Interactions 
when  a2  =  Pi  =  e,  as  =  ^^2  =  -e,  with  a  =  .05,  sample 
size  n  =  50. 


c 

Stat. 

EXP 

DEX 

CAU 

0.50 

F 

.449 

.540 

.018 

Fr 

.997 

.679 

.311 

Fa 

.999 

.727 

.564 

Fm 

1.000 

.780 

.999 

1.00 

F 

1.000 

.996 

.024 

Fr 

1.000 

.999 

.823 

Fa 

1.000 

1.000 

.806 

Fm 

1.000 

1.000 

1.000 

1.50 

F 

1.000 

1.000 

.028 

Fr 

1.000 

1.000 

.980 

Fa 

1.000 

1.000 

.954 

Fm 

1.000 

1.000 

1.000 

Sample  size  (n),  a  =  .05 


c  Stat. 

5 

10 

20 

50 

*  Type  I  Error 
0.50  A 

.065 

.060 

.056 

.051 

M 

.072 

.066 

.062 

.056 

1.00  A 

.063 

.057 

.054 

.046 

M 

.069 

.063 

.059 

.052 

*  Power  Analysis 

0.50  A  .138 

.221 

.425 

.867 

M 

.150 

.234 

.440 

.874 

1.00  A 

.421 

.755 

.985 

1.000 

M 

.440 

.770 

.985 

1.000 

Table  5:  Type  I  Error  and  Power  of  R  and  M . 
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Table  4:  Power  analysis  of  Tests  for  Interactions  when 
Til  =  741  =  A  =  ^  =  ‘05,  sample  size 

n  =  50. 
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Abstract 


Wilcoxon’s  signed  rank  sum  test,  Wilcoxon’s  rank  sum 
test  and  Ansari-Bradley’s  rank  sum  test  are  three  well 
known  distribution-free  tests.  When  the  samples  size 
is  large  enough,  the  lower  tail  probabilities  Po  [T„  <  i], 
Po  [Wm.n  <  x]  and  Po  [Am.n  <  may  be  eeisily  com¬ 
puted,  under  Hoy  using  some  normal  approximations. 
When  the  size  of  the  samples  is  too  small  these  nor¬ 
mal  approximations  become  unfortunately  insufficient. 
So  the  main  goal  of  our  work  is  to  find  some  fast  algo¬ 
rithms  which  compute  the  exact  lower  tail  probabilities 
Po  [Tn  <x]y  Po  [Wrriyn  <  and  Pq  [Am,n  <  x]  when  the 
normal  approximation  is  inefficient. 


1  Introduction 


Wilcoxon’s  signed  rank  sum  test,  Wilcoxon’s  rank  sum 
test  [2]  and  Ansari-Bradley’s  rank  sum  test  [1]  are  three 
well  known  distribution-free  tests.  The  two  first  may 
be  used  to  investigate  the  presence  of  a  shift  in  location 
between  two  populations,  whereas  the  last  one  may  be 
used  to  investigate  the  presence  of  a  difference  in  scale 
between  two  populations  having  unknown  but  equal  me¬ 
dians.  These  tests  are  based  respectively  on  Wilcoxon’s 
Tn  statistic,  Wilcoxon’s  Wm,n  statistic  (which  is  closely 
related  to  Mann- Whitney’s  Um,n  statistic  [3]),  and  on 
Ansari-Bradley’s  Am.n  statistic. 

When  the  sample  size  is  large  enough,  the 
lower  tail  probabilities  Pq  [Tn  <  x],  Po  [Wm,n  <  x]  and 
Po  [^m,n  <  x]  may  be  easily  computed,  under  Hq  :9  ^0 
(no  shift  in  location)  and  Hq  :  =  I  (no  difference 

in  scale),  using  some  normal  approximations.  For  a  full 
description  of  these  approximations  see  [4]  pages  28,68- 
69,85. 


When  the  size  of  the  samples  is  too  small  (i.e.  n  <  15 
for  Wilcoxon’s  T„  statistic  and  m  -h  n  <  20  for  the 
others),  these  normal  approximations  become  unfortu¬ 
nately  insufficient.  So  the  main  goal  of  our  work  is 
to  find  some  fast  algorithms  which  compute  the  exact 
lower  tail  probabilities  Po[2n  <  x],  Po[W^m,n  <  x]  and 
Po  [-4m, n  <  x]  when  the  normal  approximation  is  ineffi¬ 
cient. 


2  Wilcoxon’s  T„  statistic 


The  Tn  statistic  can  only  take  values  between  0  and 
n{n  -h  l)/2.  So  we  can  deduce  a  first  elementary  rela¬ 
tion 

r  0  if  X  <  0 

Po[r„<a;]  =  |j  »(n  +  l) 


The  Tn  distribution  is,  under  Ho  :  =  0,  symmet¬ 
ric  about  n(Ti  4-  l)/4.  It  is  thus  possible  to  compute 
Po  [Tn  <  x]  using  a  value  which  is  always  smaller  than 
n(n  -h  l)/4.  Therefore,  when  x  >  n(n  •+  l)/4,  we  can 
apply  a  second  elementary  relation 


Po  [Pn  <  x]  =  1  —  Po 


n{n  +  1)  ■ 

— — '-’j 


The  lower  probability  Po  [Tn  <  x]  may  be  computed  by 
counting  the  number  T^^  of  ^-tuples,  k  G  {0, 1, . 
among  {1, 2, . . . ,  n}  having  a  sum  less  or  equal  to  x  and 
by  divising  it  by  + - h  C”  =  2” 

rp<X 

Po[Tn<x]  =  -^ 


Now  let  us  define  Tn, jb  to  be  the  set  of  all  the  Ar-tuples 
among  {l,2,...,n}  and  T^l  to  be  the  number  of  k- 
tuples  of  Tn,k  having  a  sum  less  or  equal  to  x.  With 


456  Computation  of  Wilcoxon  ’s  and  Ansari-Bradley  ’s  Statistics 


these  definitions,  it  is  clear  that 


Let  us  define  and  to  be  respectively  the 

minimal  and  maximal  sums  of  the  elements  of  all  fc- tuples 
of  Tn,k-  If  we  now  define  >  0  to  be  the  largest  integer 
verifying  x  >  and  k2  >  to  be  the  smallest  integer 
verifying  x  <  ,  then  we  have 

ki  fca  — 1 

^=0 

The  values  and  may  be  computed  using  the 

following  recurrent  relations 

Ttmin  n 

n,0  —  ^ 

T„";o“  =  0 

Z»min  /T^min  ,  t, 

n,Jb  -  +  ^ 

T„T  =  + 


We  can  show  that  the  value  may  be  computed 
using  the  following  relation 

n-k+l 

rpKx  _  X  ^  rpKx^jk 

i=i 


Wilcoxon’s  Wmn  statistic 


m(m  +  l)/2.  Then  we  can  always  suppose  that  m  <  n, 
and  if  it  is  not  the  case  the  following  rule  may  be  used 


Po  [Wm,r»  <x]  =  Po  Wn^m  <  - 


m(m+  1)  n(n  +  1) 
2  2 


Let  us  define  and  to  be  the  number  of  m- 
tuples  among  {1,2,...,  m-j-n}  having  a  sum  respectively 
less  or  equal  to  x  and  equal  to  x.  Under  ^  =  0,  the 
lower  tail  probability  Po  [Wm,n  <  x]  is  determined  by  the 
ratio 

Po  [Wm,n  <  x]  =  ^  ’ 


Let  us  also  define  Wm,n,ky  A;  €  {1, 2, . . n+1}  to  be  the 
set  containing  all  the  m-tuples  among  {1, 2, . . . ,  m  -f-  n} 
beginning  by  A:,  and  to  be  respectively 

the  minimal  and  maximal  sums  of  the  elements  of  all  m- 
tuples  of  Wm,n,k-  These  values  may  be  computed  using 
the  following  recurrent  relations 


=  m(m  +  l)/2 

r^max 

m,n,l 

=  m(m  +  2n  +  l)/2  —  n 

pymin 

wrnax 

=  w;;:x,-^  +  i 

We  also  introduce  to  be  number  of  m-tuples  of 

Wm,n,k  having  a  sum  less  or  equal  to  x.  By  definition, 
we  have 

*=1 


The  Wm,n  statistic  can  only  take  values  between 
m(m  -h  l)/2  (i.e.  J\ri,X2, . , . ,  are  all  smaller  than 
Vi,  72, ... ,  Ym)  and  m(m+2n+l)/2  (i.e.  Xi ,  X2, . . . , 
are  all  greater  than  Yi,  12)  •  •  • , 7^).  Thus, 


Po[Wm^n<x]  = 


0  if  X  < 


1  if  X  > 


m(m  -h  1) 

,  2 

m(m  +  2n-|- 1) 


Under  Hq  :  0  =  0,  Wm.n  has  a  symmetric  distribution  ^ 
about  m(m  +  n  H- 1)/2.  Hence  we  can  deduce  a  relation 
which  is  usefull  when  2x  >  m(m  -h  n  -f  1) 

Po  [Wm.n  <  x]  =  1  -  Po  [Wm,n  <  m{m  +  n  +  1)  -  I  -  1]  and  then 


When  m  and  n  are  exchanged,  the  distribution  of 
V/m,n  under  Po  •  ^  =  0  is  just  shifted  by  n{n  -4-  l)/2  — 


By  a  similar  reasonning,  if  we  define  iti  >  1  to  be  the 
largest  integer  verifying  x  >  and  k2  >  k\  to  be 

the  smallest  integer  verifying  x  <  ,  then  we  have 

=  i; 

fc=l  k=ki-\-\ 


We  can  show  that  17^^  jt  may  be  recursively  com- 
puted  using  the  simple  following  relation 


'*^m,n,k  l,n— Jb+1 


^3—1 

W-^  ^<x-mk 

Jfc=l  Jb=:Jki+l 
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The  probability  Po  [Wm,n  =  x]  niay  be  computed 
by  counting  the  number  m-tuples  among 

{1,2,  +  «}  having  a  sum  equal  to  x  and  by  di- 

vising  it  by 

w-^ 

Po[Wm,n=x]^^ 


By  a  similar  reasonning  (and  with  the  same  notations), 
we  get 

Jba-l 

TMT-X  _  ^=:X-mk 


4  Ansari-Bradley’s  Am,n  statistic 

Let  ^  and  „  be  respectively  the  minimal  and  max¬ 

imal  values  that  an  Am,n  statistic  can  take,  such  that 

»<*]={;  Ill'll 


These  values  depend  on  the  parity  of  m  and  n.  We  can 
show  that  if  m  is  even  then 

.  _  m(m  -h  2) 


AH  _ 
*^m,n 


m(m  +  2n  -f  2) 


and  if  m  is  odd  then 

_  (m  +  l? 


m(m  -f  2/1  -h  2)  -b  1 
— - - - - -  if  n  IS  even 

m(m  +  2n -I- 2)  -  1  . 

— —  if  n  is  odd 
4 


Thus,  if  m  -b  n  is  even 
Po  \Afn^n  ^  ~ 

1  -  Po  [A„,m  <  -X-l] 

and  if  m  4-  n  is  odd  then 

P,  {A„.n  <  .)  =  1  -  J>.  [^n,™  <  !^l±l±i£  -  .  -  l] 


We  can  always  suppose  that  m  <  n,  and  if  it  is  not 
the  case  we  have  to  apply  one  of  the  previous  relations. 

Let  us  define  A^%  to  be  the  number  of  m-tuples,  de¬ 
fined  by  Ansari-Bradley’s  rule,  having  a  sum  less  or  equal 
to  X,  Under  Hq  :  =  1,  the  lower  tail  probability 

Pq  [Am,n  <  is  determined  by  the  following  ratio 

Po  [Am,n  < 


If  we  define  s  >  m  to  be 


(m  -b  n) 


if  m  +  n  is  even 


(m  -b  n  -  1)  . 


if  m  +  n  is  odd 


then  we  can  also  define  Am,n,ki  ^  ^ 
be  the  subsets  containing  all  the  m-tuples,  defined  by 
Ansari-Bradley’s  rule,  having  k  elements  amon^  the  s 
first  ones.  For  each  subset  Am,n,kj  we  define  to 

be  the  number  of  m-tuples  of  Am,n,k  having  a  sum  less 
or  equal  to  x.  Hence  we  have 


A-^ 


Under  Pq  •  7^  =  1»  Am,n  has  a  symmetric  distribution 
about  its  mean  m(m  -b  n  -b  2)/4  only  if  m  +  n  is  even. 
In  this  case  we  can  apply  the  following  relation  when 
4x  >  m(m  -b  n  4  2) 

P,  [X«,„  <  .1  =  I-Po  <  =!(l!liL±a  -  .  - 1 


—  X  —  1 


When  m  and  n  are  exchanged,  the  distribution  of 
Wm,n  under  Po  ^  ^  =  0  is  shifted  and  inverted  such 
that 

Po  [Am,n  ^  ^  ~  Po  [An,m  <  ^m,n  d"  “  3?  —  l] 


But  for  fc  =  0  we  have  q  —  for  fc  =  m 

we  have  So  the  previous  may  be 

replaced  by 

m— 1 

-I-  4. 


The  s  first  elements  of  subset  Am,n,k  contain  all  the  fc- 
tuples  of  k  integer  among  {1, 2, . . . ,  s}.  Let  and 
be  respectively  the  minimal  and  maximal  sums  of  this  s 
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first  elements.  We  can  easily  show  that  these  values  may 
be  computed  using  the  following  recurrent  relations 

0^9,0  =  0 

Ps,o  =  0 

1^8, k  —  Ps,k^l  S  ^  k 1 

If  we  denote  (for  clarity)  ji  =  a^  k  and  j2  =  we 
can  show  that 


[2]  Wilcoxon  F.  Individual  comparisons  by  ranking 
methods.  BiomtiricSy  l:80~-83,  1945. 

[3]  Mann  H.B.  and  Whitney  D.R.  On  a  test  of  whether 
one  of  two  random  variables  is  stochastically  larger 
than  the  other.  i4nn.  Math,  Statist,  18:50-60,  1947. 

[4]  Hollander  M.  and  Wolfe  D.A.  Nonparametric  statis- 
tical  methods,  Wiley  publications,  1973. 


;=ji 


and  finally 


+ 


w-^ 

m~l  ia 

E  E 

k=l  j=ji 

w-^ 

m,5  — m 


5  Computer  performances  and 
Conclusions 

The  algorithms  for  computing  the  Wilcoxon ’s  and 
Ansari-Bradley’s  statistics  have  been  written  in  C  Lan- 
gage,  compiled  with  the  GNU  GCC  compilator  and 
tested  on  a  SUN-IPC.  The  average  time  for  computing 
Pq  [T„  <  x]  is  approximately  0.05  seconds  for  n  =  20, 
and  the  average  time  for  computing  Pq  [Wm,n  <  and 
Pq  [Am,n  <  x]  for  each  combination  of  m  +  n  =  20  is  also 
approximately  0.05  seconds.  These  results  show  clearly 
that  the  average  response  time  of  the  three  proposed  al¬ 
gorithms  is  very  small,  and  thus  statistical  tables  become 
useless. 

These  paper  shows  also  that  the  Ansari-Bradley’s 
Am,n  statistic  may  be  computed  in  term  of  Wilcoxon ’s 
Wm,n  statistic.  This  reduces  the  size  of  the  algorithms 
themselves. 
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Abstract 

Nonparametric  estimation  of  functionals  is  performed 
in  a  biased  sampling  model.  Several  samples  are  inde- 
pedently  drawn  from  different  populations,  with  con¬ 
straints  imposed  on  the  underlying  distributions.  A 
functional  of  interest  is  expressed  via  a  certain  convex 
combination  of  the  underlying  distributions. 

A  new  estimation  procedure  is  described  here.  It  gives 
an  asymptotically  efficient  estimate  of  a  quite  arbitrary 
functional.  This  procedure,  properly  modificatied,  is 
also  relevant  to  estimate  the  whole  distribution  function 
and  various  non-  linear  functionals.  As  an  alternative  to 
the  known  estimation  procedure  studied  by  Gill,  Vardi 
and  Wellner  (1988),  this  technique  seems  to  require  less 
computations. 

1  Introduction 

Various  studies  conducted  in  technometrics  and  econo¬ 
metrics  are  based  on  data  that  emerge  from  a  population 
with  “slowly  changing”  features.  Several  mathematical 
models  have  been  proposed  to  describe  a  drifting  popula¬ 
tion  via  dynamic  changes  in  the  underlying  distribution. 
A  biased  sampling  model  can  be  also  considered  as  one 
of  those.  The  model  was  initially  considered  by  Vardi 
(1985)  who  derived  nonparametric  maximum  likelihood 
estimates.  Later  Gill,  Vardi,  and  Wellner  (1988)  proved 
asymptotic  optimality  results  for  a  nonparametric  max¬ 
imum  likelihood  estimate  of  a  cumulative  distribution 
function  (CDF). 

Under  biased  sampling,  subsamples  are  drawn  inde¬ 
pendently,  with  constraints  imposed  on  the  underlying 
distributions.  This  paper  describes  an  alternative  esti¬ 
mation  technique  for  a  functional  of  interest, 

4'(F)  =  jK{x)dG{x) 

represented  as  an  integral  with  respect  to  a  convex 
combination,  G,  of  the  underlying  distributions.  The 
method  studied  in  this  paper  requires  less  computations 


and  attains  the  same  asymptotic  perfomance  level  as 
the  one  derived  in  Gill  et  al.  (1988).  Applications 
include  a  wide  variety  of  more  complicated  transforms 
and  functionals  of  G.  Once  the  estimate,  G,  is  derived 
for  the  entire  CDF,  G,  the  plug-in  rule  suggests  to  use 

^  ^G^  as  an  asymptotically  efficient  estimate  of 

^(G),  for  a  quite  arbitrary  transform,  ’®^(G).  As  an 
illustration,  a  simple  example  emerging  from  economet¬ 
ric  studies  is  considered  and  numerical  results  are  pre¬ 
sented.  The  algorithm  and  its  modifications  are  pre¬ 
sented.  Heuristic  considerations  are  aimed  to  explain 
why  and  how  this  technique  does  work.  As  to  the  proofs, 
they  seem  to  be  quite  long  and  technical  and  will  be  pre¬ 
sented  in  the  paper  in  preparation  (Koshevnik,  1994). 

2  Construction  of  the  Estimate 

Suppose  that  s  samples  (or  strata), 

X  =  {Xij  :  l<i<nj]l<j  <s}  (1) 

are  collected  indepedently,  so  that  the  j-ih  stratum 
comes  out  from  a  distribution  Fj ,  (1  <  j  <  «).  It  is  con¬ 
venient  to  use  another  probability  mechanism  to  describe 
the  data  generaition  as  follows.  Consider  a  bivariate  ran¬ 
dom  variable.  ( J,  X)  where  J  takes  values  1, 2, . . . ,  s  with 
certain  probabilities,  say  P  (/  =  j)  =  Aj ,  which  add  up 
to  1,  i.e.  ^  k  Each  Fj  is  nothing  but  the  con¬ 

ditional  distribution  of  X,  given  J  z=.  j.  X  functional  (or 
coyarameter)  of  interest  introduced  as  a  convex  combi¬ 
nation  of  integrals, 

^  =  E  (*)  (2) 

i=l  ^ 

is  the  expectation  taken  with  respect  to  a  joint  distribu¬ 
tion,  P,  of  the  pair  (J,  A), 

^  =  J  Kij,x)dP{j,x),  (3) 
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where  K{jiX)  =  Kj{x).  Constraints  on  the  condi¬ 
tionals  {Fj  :  I  <  j  <  s)  are  expressed  in  terms  of  the 
marginal  distribution,  G,  of  A",  i.e. 

(4) 

i=i 

Under  the  biased  sampling  model,  these  conditionals  sat¬ 
isfy  an  infinite  system  of  equations 

(5) 

and  due  to  this  reason,  they  all  can  be  expressed  by 
means  of  G.  The  weight  functions,  {Wj  ;  1  <  j  <  5) 
are  given,  but  the  weight  coefficients,  or  simply  weights, 
(^i  *  1  <  i  be  unknown.  If  there  were  no  re¬ 

lation  between  the  conditionals,  the  natural  estimate  of 
(3)  that  replaces  G  by  a  similar  convex  combination  of 
empiricals  {Fj  :  1  <  i  <  s) ,  i.e. 

G  =  j^\iFi  (6) 

>  =  1 

would  have  been  impossible  to  improve.  Additional  in¬ 
formation  provided  by  constraints  enables  one  to  adjust 
the  initial  estimate  by  means  of  the  device  that  per¬ 
forms  an  orthogonal  projection  onto  some  subspace  in 
the  space  of  estimates. 

Practically,  even  the  proportions  (Ay  :  1  <  i  <  s)  can 
be  unknown,  but  in  this  case,  it  is  natural  to  assume 
their  empirical  analogues  to  be  consistent  estimates,  i.e. 

=  (7) 

for  1  <  i  <  5.  As  far  as  the  weights  (uy  :  1  <  j  <  s) 
are  concerned,  they  also  can  be  either  known  or  un¬ 
known.  We  consider  both  of  the  options.  If  the  weights 
are  given,  they  can  be  all  made  equal  to  1,  since  other¬ 
wise  each  function,  Wj^  will  be  replaced  by  wy  Wy.  With 
the  weights  unknown,  since  both  G  and  each  Fy  are  prob¬ 
ability  measures,  the  following  equations 

(^Jwj{x)dG(x)^  '  (8) 

hold.  This  suggests  the  estimation  procedure  that  ap¬ 

proximates  weights  from  the  data. 

For  a  coparameter  of  interest, 

^{G)  =  J  K{x)dG(x),  (9) 

an  alternative  representation,  via  the  joint  distribution, 
F,  of  a  pair  {JyX)  is  exploited.  Integration  over  P  is 


nothing  but  a  convex  combination  of  integrals  with  re¬ 
spect  to  the  conditionals,  as  in  (2),  with  Kj  =  This 

suggests  an  estimate,  $  =  of  ^  (G)  defined  as 

=  (“) 

with  a  vector  of  unknown  coeulcients 

P  =  {Pj-  1  <  i  <  s) 

and  empirical  distributions  {^Fj  :  1  <  j  <  replacing 
the  conditionals.  If  both  the  weights  and  proportions  are 
specified,  so  that  every  wy  can  be  made  equal  to  1,  it  will 
be  explained  later  that  the  coefficients  :  1  <  i  <  s) 
can  be  selected  equal  to  the  corresponding  A’s, 

Generally,  however,  there  will  be  an  adaptive  proce¬ 
dure  required  to  construct  the  asymptotically  efficient 
estimate.  Asymptotically,  this  estimator  differs  from  the 
one  with  the  known  weights.  Asymptotic  efficiency  for 
the  estimate  can  be  attained  in  this  case  due  to  a  two- 
step  procedure.  At  first,  the  weights  are  empirically  ap¬ 
proximated,  via  their  consistent  (even  ^/W-consistent) 
empirical  estimates,  and  then,  with  these  substitutes, 
the  estimate  is  produced,  as  if  the  unknown  weights  were 
equal  to  their  empirical  analogues.  An  explanation  of 
the  algorithm  is  based  on  the  minimization  procedure. 
Among  all  possible  estimates  of  ^  (G)  as  in  (9),  pick  up 
the  one  that  minimizes  the  asymptotic  variance  in  (10), 

Var 

This  procedure  will  typically  give  a  vector 

0  =  {0j  =  /?y(Fi....,F.):  1  <  i  <  5) , 

whose  components  depend  on  G,  or  equivalenltly,  on  all 
conditionals.  Weak  convergence  results  valid  for  the  em¬ 
pirical  process  defined  by  the  estimate,  G,  of  the  cumu¬ 
lative  distribution  function,  G,  are  shown  in  (Koshevnik, 
1994)  to  hold  uniformly  in  P  E  Z/.  Here  W  is  a  suitably 
chosen  (small)  neighborhood  of  P,  so  it  becomes  possible 
to  replace  unknown  weights  by  their  V^^consistent  es¬ 
timates.  The  similar  idea  was  successfully  implemented 
in  Koshevnik  and  Levit  (1976),  to  construct  an  asymp¬ 
totically  efficient  estimator  for  a  functional  of  interest 
from  independent  identically  distributed  data. 

3  Algorithm  Description 

Here  we  outline  the  algorithm  showing  how  the  asymp¬ 
totically  efficient  estimates  can  be  computed.  Three  op¬ 
tions  are  considered:  the  first  one  faces  with  the  case 
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when  both  weights  and  proportions  are  known;  the  sec¬ 
ond  one  is  designed  for  known  weights  and  unknown  pro¬ 
portions;  and  the  third  one  is  aimed  to  estimate  a  func¬ 
tional  of  interest  under  both  weights  and  proportions 
unknown. 


3.1  Known  Weights  and  Proportions 

In  this  case,  assume  that  all  cjj  =  1.  Since  all  \j  are 
known,  the  equation 


contains  uncertainty  in  the  distributions  only.  Therefore, 
the  functional  of  interest  (9)  can  be  expressed  as  a  con¬ 
vex  combination  of  integrals  with  respect  to  the  distribu¬ 
tions  Fj,  which  are  constrained  by  (5).  Putting  a  mass 
\j  {Wj  into  each  Xji,  and  taking  the  empirical 

average  with  respect  to  this  weighted  empirical  distri¬ 
bution,  we  obtain  the  desired  estimate  of  a  functional 
(9).  In  particular,  this  method  gives  an  asymptotically 
efficient  estimate  of  G(t)  for  any  fixed  t. 


3.2  Known  Weights  and  Unknown  Pro¬ 
portions 

Once  we  agree  that  sample  proportions  provide  con¬ 
sistent  estimates  of  unknown  Aj’s,  i.e.  (7)  takes  place, 
the  similar  estimate  can  be  proposed  to  estimate  a  func¬ 
tional  (9)  in  this  case.  Replacing  unknown  proportions 
by  their  sample  analogues  and  putting  the  mass 

into  Xjij  we  obtain  an  estimate  of  G.  Averaging  with 
respect  to  this  weighted  empirical  distribution  will  give 
an  asymptotically  efficient  estimate  of  (9),  as  before. 

3.3  Unknown  Weights  and  Proportions 

In  this  case  we  can  no  longer  ingnore  that  the  weights 
are  unknown.  As  far  as  the  proprotions  are  concerned, 
their  sample  versions  still  are  assumed  to  be  consistent 
estimates.  The  proposed  estimate  of  (9)  is  as  in  (10), 
with  the  coefficients  j3  :  I  <  j  <  s)  obtained  from 

the  minimization  procedure  followed  by  the  plug-in  rule. 
Equivalently,  the  first  step  of  this  algorithm  defines  /?  as 
a  vector  of  functionals,  with  components  depending  on 
the  conditionals,  while  the  second  step  replaces  these 
distributions  by  their  empirical  versions,  each  of  them 
based  on  the  corresponding  stratum. 


4  Asymptotic  Efficiency:  Out 
line 


The  constraints  (5)  imposed  on  the  distributions 
can  be  exploited  to  derive  the  iniormation  inequali¬ 
ties  similar  to  those  presented  in  Bickel  et  al.  (1993), 
see  also  Koshevnik  and  Levit  (1976).  To  understand 
why  the  proposed  methods  provide  asymptotically  ef¬ 
ficient  estimates,  consider  the  case  s  =  2  here.  Let 
Fi  and  F2  be  mutually  absolutely  continuous.  Again, 
G  =  Al  Fj  +  A2  F2  is  a  convex  combination  of  the  un¬ 
derlying  conditional  distributions  here.  Assume  that  Aj 
and  A 2  are  given  and  the  weights  are  known.  Then,  a 
functional  (9)  can  be  written  as 


Al 

Ui 


dFi  +  — 
W2 


/ 


W2 


dF2. 


so  having  the  weights  known,  it  is  easy  to  find  out  that 
the  adjustment  proposed  for  this  case  generally  will  give 
just  the  weighted  empirical  distribution.  With  the  pro¬ 
portions  Aj ’s  unknown,  the  similar  procudure  will  be  rel¬ 
evant  due  to  a  general  result  in  (Koshevnik,  1994).  This 
result  implies  in  particular  that  the  weighted  empirical 
processes 


y/N 


E  {f-i  -  fi) 


converge  weakly  to  certain  Gaussian  processes,  not 
only  for  every  vector  £  =  (^j  •  1  <  i  <  s)  satisfying 
=  1,  but  uniformly  in  Hence,  convergence  (7) 
implies  that  the  replacing  Xj  by  its  consistent  estimate 
^  will  not  cause  any  change  in  asymptotic  behavior  of 
the  estimate. 

A  slightly  more  coiiiplicated  argument  can  be  ex¬ 
ploited  to  prove  that  in  the  case  of  unknown  weights 
u^j-’s,  the  same  general  result  from  (Koshevnik,  1994) 
implies  uniform  (in  E  B,  where  B  is  a  set  of  vectors 
/?)  weak  convergence  of  the  weighted  empirical  distribu¬ 
tions.  Hence,  having  replaced  the  unknown  weights  by 
their  \/^"Consistent  estimates,  we  will  change  only  the 
variance-covariance  structure  of  the  limiting  Gaussian 
process, 

y/N  (g(0  -  G{t)) 

Fortunately,  this  will  be  just  the  same  limiting  process 
that  appears  in  Gill,  Vardi,  and  Wellner  (1988)  to  de¬ 
scribe  both  the  limit  in  distribution  for  the  proposed 
estimate  and  the  lower  bounds  of  asymptotic  risks.  The 
reasons  of  this  coincidence  are  similar  to  those  noticed 
by  Koshevnik  and  Levit  (1976)  for  the  case  of  homo¬ 
geneous  observations.  (See  also  Koshevnik  and  Levit 
(1983)  where  a  more  general  result  is  presented.)  This 
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fact  enables  one  to  almost  “automatically”  prove  asymp¬ 
totic  efficiency  of  the  proposed  estimates. 

5  Example 

The  application  described  here  illustrates  how  the 
proposed  technique  works  in  a  relatively  simple  case,  for 
a  combination  of  continuous  and  discrete  data.  Mean¬ 
while,  this  example  also  explains  why  and  how  the  pro¬ 
posed  method  is  relevant  for  partially  observable  data, 
which  are  subject  to  either  censoring  or  truncation. 

A  data  set  includes  three  strata.  The  first  stratum, 
labelled  as  j  =  0,  is  simply  a  collection  of  independent 
observations  (Xq,*  :  1  <  i  <  no) ,  with  a  common  cumu¬ 
lative  distribution,  F,  For  two  given  values,  di  and  d2^ 
the  second  and  third  strata  contain  values 

(Yji  =  I{Xji<dj):  l<i<nj), 

labelled  as  j  =  1  and  j  =  2  respectively.  These  data 
came  out  from  econometric  studies  where  several  polls 
were  processed.  The  stratum  with  j  =  0  contains  com¬ 
plete  answers  on  how  much  respondents  had  actually 
paid  for  the  services  provided,  while  the  two  remaining 
strata  came  from  the  polls  aimed  to  indicate  whether 
respondents  would  have  agreed  to  pay  a  certain  amount 
of  money  for  the  same  services.  Records  included  indi¬ 
cators  of  events  such  as  X  <  di  (the  second  stratum)  or 
X  <  d2  (the  third,  stratum).  These  data  are  obviously 
incomplete.  A  coparameter  to  be  estimated  includes  two 
components  of  interest,  namely 

iF{d,),Fid,)).  (11) 

An  asymptotically  efficient  estimate  of  (11)  can  be  found 
from  the  competing  estimates  based  on  different  strata. 
Notice  that  each  of  the  cell  probabilities  in  (11)  can  be 
estimated  by  means  of  either  the  empirical  CDF  based 
on  the  complete  {j  =  0)  stratum  or  on  the  correspond¬ 
ing  stratum,  j  =  1  or  7  =  2  respectively,  of  incomplete 
dichotomous  data.  Neither  of  these  estimates  is  efficient 
generally,  but  both  of  them  are  unbiased.  A  general 
method  suggest  to  use  an  initial  estimate  such  as  F{di) 
and  F(d2)f  i.e.  the  empirical  CDF  based  on  the  stra¬ 
tum  0.  As  a  competitor,  the  proportions  pj  =  can 
be  calculated  via  Yi  and  >2,  i.e.  simply  the  sums  over 
the  1-st  and  the  2-nd  strata.  Theoretically,  these  esti¬ 
mates  have  equal  expectations,  but  to  make  them  equal 
just  means  to  find  an  adjusted  (or  balanced)  estimate  of 
(11).  The  following  algorithm  was  developed  to  estimate 
(11)  in  Koshevnik  and  Schucany  (1994). 

1.  Calculate  empirical  proportions  as  described. 


2.  The  estimate,  F(di)  ,  of  (11),  is  proposed  as 

F{di)  -  an  (^F  (di)  -  pij  -  an  (^F  {d2)  —P2J  1 

and  the  similar  expression  can  be  written  for  the 
second  component,  with  the  coefficients  021  and  022 
replacing  an  and  ai2,  respectively. 

3.  For  each  of  the  components,  the  variance  minimizar 
tion  problem  is  solved  with  respect  to  the  coeffi¬ 
cients  Uj  =  (aj'i,aj-2).  This  gives  the  coefficients 
depending  on  the  distribution  F. 

4.  The  solution  aji  (F)  is  replaced  by  its  empirical  es¬ 
timate,  using  the  empirical  function,  F,  which  will 
be  \/A-consistent,  provided  all  three  values,  i.e. 

converge  to  strictly  positive  limits,  as 
N  =  no  4*  nj  -f  n2  00. 

5.  This  will  yield  the  asymptotically  efficient  estimate 
of  (11). 

Asymptotic  efficiency  in  this  case  is  implied  by  the 
possibility  to  reduce  the  problem  into  a  parametric  one. 
In  the  meantime,  the  similar  adjustment  procedure  was 
developed  for  any  functional  such  as  F(<)  with  the  vec¬ 
tor  of  coefficients,  a  {t)  =  (oj  (t) ,  02  (<)).  Asymptotic 
efficiency  of  this  general  estimate  requires  some  facts  re¬ 
garding  empirical  processes. 

5.1  Numerical  Illustration 

The  sizes  coincide  for  all  three  strata,  i.e. 

no  =  ni  =  n2  =  100, 

while  F{di)  =  0.3,  and  F  (^2)  =  0.6.  The  calcu¬ 
lations  performed  for  incomplete  strata  have  indicated 
Pi  =  0.35  and  p2  =  0.70. 

The  coefficients  calculated  by  means  of  minimization 
of  the  variance  of  a  proposed  estimate  F  (di)  (respec¬ 
tively,  F(d2))  gave 

ail  =  0*481  and  au  =  0.067 

for  the  first  component,  and 

021  =  0.154  and  022  =  0.462 

for  the  second  one.  Under  these  values  of  the  coefficients, 
the  improved  estimate,  F(di)  of  (11)  is  equal  to 

0.30  -  0.481  •  (0.30  -  0.35)  -  0.067  •  (0.60  -  0.70)  =  0.331 

and  similarly,  F  (^2)  is  calculated  as 

0.60  -  0.154  •  (0.3  -  0.35)  -  0.462  •  (0.6  -  0.7)  =  0.654. 
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6  Conclusions 

The  method  presented  here  provides  the  same  asymp¬ 
totic  optimality  of  estimation  as  the  initial  proposal  by 
Vardi  (1985)*  Computationally,  it  turns  out  to  be  easier, 
for  it  does  not  require  some  auxiliary  components  to  be 
estimated  as  precisely  as  possible.  The  asymptotic  per¬ 
formance  attainable  by  means  of  this  method  can  hardly 
guarantee  that  it  works  perfectly  for  any  reasonably  large 
data  set.  Some  additonal  studies  are  needed  to  investi¬ 
gate  its  features  for  moderately  large  samples. 

Possible  applications  include  various  models  with  par- 
tically  observable  data,  such  as  censored  or  truncated 
observations.  Being  focused  on  the  asymptotic  opti¬ 
mality  only,  this  method  provides  a  unified  approach  to 
estimation  of  various  functionals  and  transforms  of  the 
unknown  infinite-dimensional  parameter.  Both  asymp¬ 
totic  normality  and  asymptotic  optimality  of  the  pro¬ 
posed  estimates  are  due  to  the  same  result,  known  as 
empirical  central  limit  theorem  and  derived  uniformly, 
as  the  infinite-dimensional  parameter,  (7,  in  the  consid¬ 
ered  model,  runs  over  a  small  neighborhood  in  the  set 
of  distributions.  The  only  serious  technical  limitation  of 
the  applicability  of  the  results  presented  in  Koshevnik 
(1994)  is  that  this  neighborhood  must  be  precompaci 
with  respect  to  the  Kolmogorov  distance  between  two 
distributions,  i.e. 

d  {Gi.Gt)  =  sup  |Gi(<)  -  G2(<)|- 

-oo<<<oo 
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Abstract 

Three  different  approaches  to  approximation  of  double 
bootstrap  confidence  intervals,  each  with  the  aim  of  im¬ 
proving  computational  efficiency,  are  considered.  The 
first  replaces  the  need  for  a  second  level  of  bootstrap 
sampling  by  analytic  tail  area  approximations.  The  sec¬ 
ond  performs  the  second  level  of  sampling  in  a  sequential 
manner.  The  third  uses  empirical  versions  of  asymp¬ 
totic  expansions  for  the  end  points  of  the  double  boot¬ 
strap  confidence  interval  and  for  the  additive  correction 
to  nominal  coverage  to  avoid  the  need  for  resampling. 
The  three  methods  are  compared  in  relation  to  their  re¬ 
spective  set-up  costs,  the  improvements  in  efficiency  they 
yield,  the  coverage  properties  of  the  approximate  inter¬ 
vals  and  the  generality  with  which  they  may  be  applied. 

1  Introduction 

The  iterated  bootstrap  provides  a  satisfactory  theoreti¬ 
cal  solution  to  the  problem  of  producing  non-parametric 
confidence  intervals  with  high  coverage  accuracy,  as  well 
as  stable  lengths  and  endpoints:  see  [4]  (Section  3.11). 
An  iterated  bootstrap  confidence  interval  requires  an  ad¬ 
ditive  correction  to  be  made  to  the  nominal  coverage  of 
an  uncorrected  interval.  This  correction  will  usually  be 
made  using  a  double  bootstrap  resampling  procedure  in¬ 
volving  two  nested  levels  of  Monte  Carlo  simulation,  and 
is  therefore  often  computationally  prohibitively  expen¬ 
sive  for  routine  use. 

Recently  there  has  been  much  attention  paid  to  pro¬ 
cedures  by  which  the  computational  demands  of  the  it¬ 
erated  bootstrap  confidence  interval  construction  may 
be  reduced.  In  this  paper  we  consider  three  different 
approaches  to  approximation  of  iterated  bootstrap  con¬ 
fidence  intervals.  The  first  ([2],  [3])  replaces  the  need  for 
a  second  level  of  bootstrap  sampling  by  use  of  analytic 
tail  area  approximations  based  on  saddlepoint  methods. 
The  second  ([5])  performs  the  second  level  of  sampling 
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in  a  sequential  manner.  The  third  ([6])  uses  empirical 
versions  of  asymptotic  expansions  for  the  additive  cor¬ 
rection  to  nominal  coverage  and  for  the  end  points  of 
the  iterated  bootstrap  intervals  to  provide  two  computa¬ 
tionally  attractive  methods  of  approximation.  The  first 
asymptotic  interval  replaces  the  need  for  a  second  level 
of  bootstrap  sampling  by  a  series  of  simple  numerical 
computations  which  are  readily  automated.  The  second 
interval  requires  no  resampling.  The  three  approaches 
are  compared  in  relation  to  their  respective  set-up  re¬ 
quirements,  the  improvements  in  efficiency  they  yield, 
the  coverage  properties  of  the  approximate  intervals  and 
the  generality  with  which  they  may  be  applied. 

Section  2  provides  some  background  and  a  formal  defi¬ 
nition  of  the  iterated  bootstrap  confidence  interval.  Sec¬ 
tion  3  discusses  the  analytic  approximation  approach. 
Section  4  presents  a  discussion  of  the  sequential  sampling 
idea  and  Section  5  discusses  the  asymptotic  calibration 
approach.  A  simulation  study  involving  construction  of 
bootstrap  confidence  intervals  for  the  population  vari¬ 
ance,  together  with  general  discussion,  is  presented  in 
Section  6. 

2  Iterated  Bootstrap  Confidence 
Interval 

We  will  consider  the  following  problem.  We  wish  to  con¬ 
struct  an  accurate  bootstrap  confidence  interval  for  a 
scalar  parameter  6  expressible  as  a  smooth  functon  of  a 
vector  mean:  6  =  ^(/^),  where 

/i  =  (/ii .....  /id)  =  (E{fi  iW)},...,  E{f4W)}), 

with  smooth,  real- valued  functions  and  W  de¬ 

noting  a  generic  random  variable  with  the  underlying 
Jk-dimensional  distribution  F.  The  form  of  F  is  unspeci¬ 
fied,  but  our  data  X  =  (Wi Wn)  consists  of  n  obser¬ 
vations  independently  drawn  from  F.  Suppose  further 
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that  6  is  estimated  hy  6  —  g{X),  where 

with 

X.-  =  {Xu. . . . ,  Xad  =  {/i {Wi), . . . ,  /.(PVi)}, 
i  = 

We  shall  see  later  that  assumption  of  such  a  ‘smooth 
function  modeP  is  crucial  to  the  analytic  approximation 
and  asymptotic  calibration  methods,  but  not  to  the  ap¬ 
proach  based  on  sequential  sampling. 

Let  X*  denote  a  generic  resample  -  or  ‘bootstrap 
sample^  -  of  size  n  drawn  from  X,  obtained  by  inde¬ 
pendently  sampling  with  replacement  from  X,  Denote 
by  lQ{a\X,X*)  a  bootstrap  confidence  interval  for  6  of 
nominal  coverage  a.  This  interval  Jo  could,  for  example, 
be  the  percentile  method  confidence  interval,  defined  be¬ 
low. 

The  coverage  probability  of  Jq  is 

i:{oc)  =  P{6^h{a-X,X*)\F}, 

and  in  many  cases  will  be  significantly  different  from  a. 

The  interval  Jo(a  +  i\  X,X*),  where  ir{a  +  t)  =  a, 
has  coverage  exactly  equal  to  the  nominal  coverage  a. 
Of  course,  the  value  of  the  ‘calibration  coefficient’ t  is 
rarely  available.  The  idea  behind  the  iterated  bootstrap 
in  this  context  is  that  a  bootstrap  estimator  of  i  may  be 
constructed  using  a  second  level  of  resampling. 

Let  X**  denote  a  generic  resample  from  X*  and  let 
/o(q:;X*,X’"*)  be  the  version  of  Jo(a;X,X*)  computed 
using  X*  and  X**  instead  of  X  and  X* ,  respectively. 
Then  the  bootstrap  estimate  of  7r(a)  is 

^(a)  =  P{0G  Jo(a;X^X**)|X}, 

with  the  calibration  coefficient  i  being  estimated  by  f, 
where 

%{ot  +  t)  =  a. 

The  iterated  bootstrap  confidence  interval  for  0  is  then 
Ji(a;X,Xp  =  Jo(a+f;X,X^). 

In  practice,  the  iterated  bootstrap  confidence  interval 
construction  requires  Monte  Carlo  simulation.  A  finite 
number  B  of  bootstrap  samples,  , . . . ,  are  drawn 
from  X  at  an  outer  level  of  resampling,  and  ^(a)  esti¬ 
mated  by  the  proportion 

card  {l<h<B:§e  loia;  X^,  X^*)}/B. 

Usually,  exact  evaluation  of  Iq  is  not  feasible,  so  a  second 
level  of  C  resamples  is  drawn  from  Xj^  to  approximate 

Io{a;X,\Xn. 


We  see,  therefore,  that  to  approximate  i  B  resamples 
must  be  drawn  at  an  outer  level  of  resampling,  and  C 
resamples  drawn,  for  each  outer  level  resample,  at  the  in¬ 
ner  level.  So  a  total  of  j5(C7  + 1)  bootstrap  samples  must 
be  drawn  to  construct  the  iterated  bootstrap  confidence 
interval  Ji.  Further,  both  B  and  C  must  be  large,  of  the 
order  of  1000s,  in  order  to  reduce  Monte  Carlo  simula¬ 
tion  error  to  acceptable  proportions  and  ensure  accurate 
approximation  to  the  theoretical  interval.  Some  means 
of  improving  computational  efficiency  is  desirable. 
Typically  the  confidence  interval  Jo  will  be  taken  as 
the  percentile-method  interval.  It  is  noted  (see  for  exam¬ 
ple  [4],  Section  3.11.1)  that  the  percentile  method  yields 
confidence  intervals  with  stable  lengths  and  endpoints: 
bootstrap  iteration  offers  the  prospect  of  retaining  de¬ 
sirable  stability  while  enhancing  coverage  accuracy.  The 
percentile  method  is  based  on  the  premise  that  the  sam¬ 
pling  distribution  of  6*  =  §(X'^)  under  sampling  from 
X  should  be  close  to  the  unconditional  distribution  of  § 
under  sampling  from  F, 

Define  by  P{§  <  y/j  \  F)  /3.  The  bootstrap 
estimate  is  yp,  where  P{§*  <  yp  \  X)  ^  /3,  The  (theo¬ 
retical)  nominal  a-level  percentile  confidence  interval  for 
^  is  Jo  =  y^],  where  ^  =  (1  a)/2. 

For  the  case  of  the  percentile  method  interval  Jo,  the 
approximation  to  f  (a)  becomes 

card{l  <b<B  :1^^<  P{§1*  <  0  |  X^)  <  ^}/B, 

where  $1*  =  0{X^*)  and  X^*  denotes,  as  before,  a  generic 
bootstrap  sample  drawn  from  the  outer  level  bootstrap 
sample  X^. 

3  Analytic  Approximation 

DiCiccio,  Martin  and  Young  ([2],  [3])  consider  analyt¬ 
ical  methods  which  significantly  reduce  the  computa¬ 
tional  demands  of  the  iterated  bootstrap.  Their  methods 
employ  saddlepoint  approximations  to  replace  the  inner 
level  of  resampling, 

^  Define  9*  =  y(X*)  and  r*  =  fl^(X**),  where  X*  and 
X**  are  versions  of  X  computed  using  X*  and  X**,  re¬ 
spectively,  in  place  of  X. 

The  procedure  described  by  DiCiccio,  Martin  and 
Young  ([2])  is  based  on  estimation  of  the  tail  probabil¬ 
ity  P{6*^  <  6  I  X*)  through  saddlepoint  approximation 
to  the  joint  density  of  the  components  Xi*, . . « ,X^*  of 
X**  given  X*,  together  with  application  of  a  tail  proba¬ 
bility  approximation  of  DiCiccio  and  Martin  ([!])  to  the 
saddlepoint  density. 

The  algorithm  used  by  DiCiccio,  Martin  and  Young 
([2])  for  construction  of  an  approximate  iterated  boot- 
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strap  confidence  interval  involves  first  drawing  B  re¬ 
samples  from  X.  For  each  resample  X^ 

(b  =  the  analytic  approximation  is  used  to 

estimate  ^(0?*  <  0  |  X^).  DiCiccio,  Martin  and  Young 
([2])  suggest  choosing  several  nominal  levels  71,72,- •• 
close  to  the  desired  level  a  and  determining  whether  the 
condition 

^(1  -  7i)  <  W  <  ^  I  Xi*)  <  i(l  +  7.) 

is  satisfied  for  each  7,-.  Then  an  estimate  of  7r(7,)  is  the 
proportion  among  the  B  resamples  for  which  the  condi¬ 
tion  holds  for  the  respective  7.-.  The  desired  calibration 
coefficient  i,  which  has  n{a  +{)  =  a,  is  approximated 
by  interpolation  between  the  {7i. ’•’(t*)}  Pairs-  The  ap¬ 
proximate  iterated  confidence  interval  is  the  percentile 
method  interval  of  nominal  level  o;  -b  t  based  on  the  re¬ 
samples  Xf. .  .,Xb. 

The  key  computational  requirement  of  the  procedure 
of  DiCiccio,  Martin  and  Young  ([2])  is  iterative  solu¬ 
tion  of  a  system  of  2d  -b  1  non-linear  equations  in  as 
many  unknowns,  together  with  a  series  of  matrix  inver¬ 
sions.  In  practice,  for  some  first-level  bootstrap  samples 
the  iteration  may  fail  to  converge.  When  this  occurs 
we  recommend  use  of  the  resampling  approach  instead. 
Computational  eflficiency  is  determined  largely  by  the 
frequency  with  which  the  iteration  fails  to  converge.  Di¬ 
Ciccio,  Martin  and  Young  ([2])  give  a  number  of  exam¬ 
ples  of  use  of  their  procedure,  which  demonstrate  the 
value  of  the  approach,  both  in  terms  of  accuracy  and 
computational  efficiency.  We  restrict  attention  here  to  a 
series  of  general  remarks  on  this  approach. 

(1)  It  is  observed  that  the  analytic  approximation  ap¬ 
proach  yields  confidence  intervals  with  little  dis¬ 
cernible  loss  of  coverage  accuracy  over  the  full¬ 
blown  iterated  resampling  intervals  constructed  us¬ 
ing  nested  levels  of  resampling. 

(2)  The  advantages  of  using  the  methods  -  which  may 
be  a  tenfold  reduction  or  more  in  computation  for 
simple  problems  -  diminishes  as  the  dimensionality 
d  increases,  for  then  the  complexity  of  the  iterative 
procedure  increases. 

(3)  The  methods  entail  some  setup  costs,  in  terms  of  re¬ 
coding  for  different  problems,  and,  as  already  noted, 
require  use  of  fairly  sophisticated  packaged  numeri¬ 
cal  routines  for  root  finding  etc. 

(4)  DiCiccio,  Martin  and  Young  ([3])  demonstrate  how 
the  analytic  methods  may  be  modified  to  make  con¬ 
struction  of  iterated  bootstrap  confidence  intervals 
by  this  approach  both  feasible  and  computationally 


worthwhile  in  more  complicated  situations.  They 
approximate  to  the  solution  of  the  system  of  non¬ 
linear  equations,  and  so  avoid  the  costly  iteration. 
Use  of  the  resampling  alternative  to  the  analytic 
approach  is  then  never  required.  The  crude  meth¬ 
ods  DiCiccio,  Martin  and  Young  ([3])  describe  incur 
some  loss  of  coverage  accuracy  over  the  previous  an¬ 
alytic  approach,  but  computational  savings  are  sub¬ 
stantial.  M^ile  reliance  on  sophisticated  numerical 
routines  is  reduced,  setup  costs  are  still  substantial. 

(5)  A  weakness  of  the  approach  lies  in  the  fact  that  the 
analytic  methods  are  restricted  in  use  to  the  partic¬ 
ular  smooth  function  model  described  in  Section  2 
above. 

(6)  For  a  given  problem,  the  computational  advantage 
of  using  the  analytic  approximations  of  DiCiccio, 
Martin  and  Young  ([2])  is  most  substantial  for  larger 
sample  sizes  n,  for  then  the  saddlepoint  equations 
are  generally  easier  to  solve.  However,  the  bootstrap 
is  most  likely  to  be  indicated  for  use  with  smaller 
sample  sizes. 

(7)  Computational  speed  of  the  analytic  methods  of 
DiCiccio,  Martin  and  Young  ([2])  is  obreryed  also 
to  depend  heavily  on  the  underlying  distribution, 
as  the  iteration  converges  in  many  fewer  steps  for 
some  data  samples  than  others.  Use  of  the  alter¬ 
native  analytic  procedure  of  DiCiccio,  Martin  and 
Young  ([3])  effectively  eliminates  the  dependence  of 
computational  efficiency  on  n  and  F . 

4  Sequential  Sampling 

Recall  the  algorithm  for  construction  of  the  iterated 
bootstrap  confidence  interval.  For  each  7» > *'  ~  1>  •••> I 
(I  =  3  is  sufficient  in  practice)  we  wish  to  know  whether 
the  condition 

i(l  -  7.)  <  <  0  I  XI)  <  i(l  -f  7.) 

is  satisfied,  for  each  of  B  bootstrap  samples  XI , . . . ,  Aj 
drawn  from  X.  The  value  of  p  =  P{6t*  <  0  \  X^)  is  not 
actually  required,  but  instead  we  wish  to  know  whether 
the  condition 

|(l-7i)<P<^(l  +  7<) 
is  satisfied,  1  =  1,...,/. 

Assume  0  <  71  <  72  <  *  *  *  <  7/*  Then  we  wish  to  test 
simultaneously  a  set  of  nested  hypotheses 
where  Hi  is  the  hypothesis  that  2  (^“"7t)  ^  P  ^ 
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Lee  and  Young  ([5])  demonstrate  how  to  construct 
a  simultaneous  sequential  probability  ratio  test  of  the  / 
hypotheses.  The  idea  now,  therefore,  is  to  use  a  differ¬ 
ent  number  of  second  level  resamples  for  each  first  level 
resample,  with  the  stopping  rules  of  the  sequential  prob¬ 
ability  ratio  test  designed  to  minimise  the  (asymptotic) 
expected  number  of  second  level  resamples  drawn.  De¬ 
tails  of  the  computation  of  the  stopping  rules  are  given 
by  Lee  and  Young  ([5]).  Since  the  sequential  probability 
ratio  test  is  being  proposed  as  an  alternative  to  the  use 
of  inner  level  bootstrap  sampling  with  a  fixed  number 
C  of  resamples,  the  approach  is  to  constrain  the  error 
in  testing  Hj  by  the  sequential  approach  to  be  the  same 
as  that  incurred  when  testing  Hj  by  a  fixed-sample  test 
with  sample  size  C.  Computation  of  the  stopping  rules 
then  amounts  to  solving  a  straightforward  constrained 
optimization  problem.  The  fixed  sample  size  C  is  used 
as  a  terminating  upper  bound  on  the  sequential  stop¬ 
ping  time  of  the  simultaneous  sequential  probability  ra¬ 
tio  test,  so  that  we  are  guaranteed  to  draw  fewer  second 
level  bootstrap  samples  than  in  the  standard  construc¬ 
tion  of  the  iterated  bootstrap  interval. 

By  use  of  the  sequential  sampling  idea  we  may  con¬ 
struct  an  approximation  to  the  iterated  bootstrap  confi¬ 
dence  interval  with  considerable  computational  savings 
over  the  standard  procedure  which  draws  a  fixed  num¬ 
ber  C  of  second  level  resamples  from  each  first  level  re¬ 
sample.  By  construction,  the  sequential  sampling  proce¬ 
dure  has  (asymptotically)  the  same  error  in  estimation 
of  7r(7i)  as  the  standard  procedure.  Key  remarks  on  the 
sequential  sampling  approach  are  the  following: 

(1)  The  resulting  intervals  display  no  significant  loss 
of  accuracy  over  the  full-blown  iterated  resampling 
intervals,  by  design. 

(2)  In  typical  problems,  the  sequential  intervals  use 
only  about  10-20%  of  the  computational  effort  re¬ 
quired  by  the  direct  approach.  Computational  sav¬ 
ings  are  therefore  competitive  with  those  achieved 
by  analytic  methods  in  moderately  complex  prob¬ 
lems,  though  less  in  simple  problems. 

(3)  Computational  gains  through  use  of  the  sequential 
sampling  idea  are  roughly  problem  independent  and 
also  roughly  independent  of  the  sample  size  n  or  un¬ 
derlying  distribution  F,  and  indeed  of  the  param¬ 
eter  0  being  studied.  The  approach  may  therefore 
be  used  in  any  new  problem  of  interest,  secure  in 
the  knowledge  that  it  will  yield  a  definite  level  of 
computational  saving. 

(4)  The  sequential  approach  can  be  used  for  any  pa¬ 
rameter  6,  not  just  for  the  smooth  function  model 


described  in  Section  2. 

(5)  Setup  of  the  approach  is  performed  just  once.  The 
method  may  then  be  applied  without  modification 
to  construct  a  confidence  interval  for  any  parame¬ 
ter  6,  No  sophisticated  numerical  procedures  are 
required  for  implementation. 

5  Asymptotic  Calibration 

For  the  percentile  method  confidence  interval  /o»  the  cal¬ 
ibration  coefficient  t  satisfies 

P(p  e  [yi-^-t/2>y^+t/2]  |  F*)  =  «. 

As  noted,  t  depends  on  F,  which  is  unspecified,  so  is 
unavailable. 

The  bootstrap  version  of  f  is  f  which  satisfies 

where 

Pie**  <rp\x\x)  =  p. 

Given  the  two-sided  iterated  bootstrap  confidence  in¬ 
terval  of  nominal  coverage  a  is 

Using  Hall  ([4]),  we  may  establish,  under  mild  conditions 
on  F  and  g,  asymptotic  expansions  for  t  and  yg: 

t  =  2n“‘^7ri(z^)<^(2r^)  +  +  . . . 

0^  n^^l^a{zg  -f  n~^l^pii{z^) 

-f  'nr^p2i{zp)  +  •  • «} 

for  0  <  <  1. 

In  these  expansions,  the  tt/s  are  odd  polynomials, 
the  pjis  are  polynomials  of  degree  at  most  j  -f  1  and  are 
odd  for  even  j  and  even  for  odd  is  the  asymptotic 
variance  of  — 0),  (j)  is  the  N{0, 1)  density  and  zg  = 

In  any  given  example  these  expansions  are  extremely 
complicated.  However,  since  and  the  coefficients 
of  the  polynomials  depend  only  on  moments  of  F,  we 
may  easily  establish  the  corresponding  expansions  for 
the  bootstrap  versions  i  and  yp  of  t,  yp  respectively: 

i  =  2rr^Tti{z{)<t>{z{) 

+  2n~H2{z{)(^iz^)  +  . . . 

yp  =  6  + 

+  n~^l^pii{zp)  +  n~^p2iizp) 
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Here  Try,  pji  and  are  obtained  by  substituting  sample 
moments  for  population  moments  in  the  expressions  for 
respectively. 

Define 

and 

yp  =  6  +  rr^l'^a{zp 

+  rr^l^pii{zp)  +  n~^p2i{zp)]. 

Note  that  both  these  quantities  may  be  calculated  di¬ 
rectly  from  sample  moments:  no  Monte  Carlo  approxi¬ 
mation  is  required. 

We  therefore  arrive  at  two  possible  sample-based 
asymptotic  approximations  to  the  iterated  bootstrap 
confidence  interval: 

h  = 

h  =  5^+72]* 

Key  comments  on  these  intervals,  introduced  by  Lee  and 
Young  ([6]),  are: 

(1)  The  interval  I2  still  involves  sample  quantities 

to  be  approximated  by  one  level  of  bootstrap  resam¬ 
pling.  But  the  inner  level  of  sampling  is  avoided  by 
use  of  f,  computed  directly  without  sampling. 

(2)  In  principle,  the  interval  J3  requires  no  resampling 

at  all.  The  procedure  might,  however,  occasionally 
require  some  form  of  adjustment,  if,  for  example, 
a  4-  f  0  (0, 1)  or  >  y^+f/2*  ^  we 

suggest  using  I2  instead. 

(3)  Construction  of  the  intervals  I2  and  I3  is  easily 
packaged.  The  required  computation  requires  to  be 
coded  just  once,  for  the  general  case.  Application 
then  requires  only  specification  of  the  formula  g  for 
the  parameter  0  of  interest.  The  basis  of  an  auto¬ 
matic  packaging  is  use  of  techniques  of  exact  numer¬ 
ical  derivative  evaluation.  For  details,  see  Lee  and 
Young  ([6]).  In  particular,  no  symbolic  computation 
is  required  for  practical  use. 

(4)  Through  study  of  a  range  of  problems,  it  would  ap¬ 
pear  that  asymptotic  calibration  gives  coverage  cor¬ 
rection  comparable  to  the  analytic  and  sequential 
approaches,  and  the  full-blown  iterated  bootstrap, 
at  a  fraction  of  the  computational  cost.  An  indica¬ 
tion  of  the  levels  of  computational  saving  is  given  in 
the  example  of  Section  6  below. 

(5)  The  asymptotic  calibration  requires  purely  arith¬ 
metic  computation,  and  computational  savings  are 
therefore  independent  of  sample  size  or  underlying 
distribution. 


(6)  Use  of  asymptotic  calibration  is,  however,  restricted, 
as  with  the  analytic  approach,  to  the  smooth  func¬ 
tion  model. 

6  Simulation  Study 

A  simulation  study  has  been  carried  out  on  the  variance 
example  studied  by  Schenker  ([7])  and  DiCiccio,  Mar¬ 
tin  and  Young  ([2]).  The  parameter  of  interest  0  is  the 
population  variance  and  its  estimate  0  is  the  (biased) 
sample  variance.  The  study  compared  the  coverage  ac¬ 
curacy  of  the  (uncorrected)  percentile  confidence  interval 
Iq  with  the  full-blown  iterated  bootstrap  interval  7i,  ap¬ 
proximated  using  two  nested  levels  of  resampling.  Also 
compared  were  the  asymptotic  intervals  I2  and  J3,  the 
sequential  interval  7,  of  Lee  and  Young  ([5])  and  the  two 
approximate  intervals  Iai  and  7x2  described  by  DiCic¬ 
cio,  Martin  and  Young  ([2])  and  DiCiccio,  Martin  and 
Young  ([3])  respectively. 

Four  different  underlying  distributions  with  various 
degrees  of  skewness  and  kurtosis  were  used:  the  stan¬ 
dard  normal  i\r(0, 1),  with  no  skewness  and  no  kurto¬ 
sis,  the  folded  normal  |iV’(0, 1)|,  with  high  skewness  and 
low  kurtosis,  the  double  exponential  of  unit  rate  with  no 
skewness  and  high  kurtosis,  and  finally,  the  log  normal, 
exp(W(0, 1)),  which  has  high  skewness  and  high  kurtosis. 
The  variances  are  respectively  1,  1  —  2/7r,  2  and  e(e  —  1). 
Three  different  sample  sizes  were  taken:  n  =  20,  35  and 
100  respectively.  The  full-blown  iterated  interval  7  was 
not  constructed  for  n  =  100  due  to  its  immense  compu¬ 
tational  demands  in  this  case. 

The  coverage  probabilities  of  the  various  confidence 
intervals  were  approximated  from  1600  random  sam¬ 
ples,  so  that  each  coverage  figure  has  a  standard  er¬ 
ror  of  approximately  0.01.  Intervals  7o,  72,  7x1  Ia2 
were  constructed  using  B  =  1000  bootstrap  resamples. 
The  full-blown  iterated  interval  7i  was  constructed  us¬ 
ing  C  =  1000  inner  level  bootstrap  samples.  The  se¬ 
quential  interval  7,  was  constructed  using  B  =  1000 
outer  level  bootstrap  samples:  the  inner  level  of  sam¬ 
pling  was  performed  sequentially,  subject  to  an  upper 
limit  of  (7  =  1000  bootstrap  samples  being  drawn  from 
any  given  outer  level  sample.  The  analytic  interval  7xi 
generally  requires  no  inner  level  resampling.  However, 
occasionally  the  iteration  required  by  the  analytic  ap¬ 
proximation  failed  to  converge.  In  these  circumstances 
the  interval  was  constructed  by  the  resampling  method, 
using  C  =  1000  inner  level  resamples.  The  interval 
generally  requires  no  resampling.  However,  in  the  case  of 
erratic  asymptotic  interval  end-points  where,  for  exam¬ 
ple,  the  lower  limit  exceeds  the  upper  limit,  the  interval 
73  was  replaced  by  72. 
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Table  1:  Estimated  coverage  probabilities  for  variance,  based  on  1,600  random  samples  of  sizes  n  =  20,  35  and  100 
drawn  from  each  of  four  different  distributions,  h  is  full-blown  interval,  Jo  is  uncorrected  percentile  interval.  Is  is 
sequential  interval,  h  and  h  are  asymptotic  intervals,  Jai  and  Ia2  are  saddlepoint-based  analytic  intervals. 


Normal  data  JV’(0, 1)  (no  skew,  no  kurtosis) 


Interval 

coverage,  n  =  20 

coverage,  n  =  35 

coverage,  n  =  100 

h 

0.833 

0.854 

0.883 

h 

0.832  (0.161) 

0.853  (0.014) 

0.884  (0.000) 

Jo 

0.727 

0.793 

0.857 

h 

0.848 

0.859 

— 

Is 

0.829  (190.2) 

0.851  (166.8) 

0.883  (143.1) 

Iai 

0.820  (0.001) 

0.843  (0.000) 

0.879  (0.000) 

Ia2 

0.803 

0.829 

0.873 

Folded  normal  data  |iV'(0, 1)|  (high  skew, 

ow  kurtosis) 

Interval 

coverage,  n  =  20 

coverage,  n  =  35 

coverage,  n  =  100 

h 

0.803 

0.821 

0.874 

h 

0.800  (0.285) 

0.819  (0.101) 

0.880  (0.003) 

Jo 

0.686 

0.753 

0.843 

Jl 

0.815 

0.834 

— 

Is 

0.793  (195.4) 

0.823  (176.6) 

0.876  (151.6) 

Iai 

0.792  (0.024) 

0.815  (0.003) 

0.873  (0.002) 

7a2 

0.778 

0.798 

0.860 

Double 

exponential  data  ( 

^exp(— |ar|)  (no  skew,  high  kurtosis) 

Interval 

coverage,  n  =  20 

coverage,  n  =  35 

coverage,  n  =  100 

0.811 

0.846 

0.869 

0.809  (0.304) 

0.848  (0.118) 

0.872  (0.013) 

lo 

0.698 

0.776 

0.834 

0.826 

0.854 

— 

IB 

0.796  (202.4) 

0.844  (182.4) 

0.871  (157.1) 

0.803  (0.026) 

0.840  (0.002) 

0.869  (0.000) 

mSm 

0.783 

0.817 

0.850 

Log  normal  data  exp  {^^(0, 1)}  (high  skew,  high  kurtosis) 


Interval 

coverage,  n  =  20 

coverage,  n  =  35 

coverage,  n  =  100 

mgam 

0.526 

0.602 

0.696 

mm 

0.526  (0.533) 

0.602  (0.393) 

0.696  (0.216) 

Jo 

0.416 

0.504 

0.608 

Jl 

0.544 

0.630 

— 

Is 

0.513  (218.7) 

0.589  (207.7) 

0.706  (190.6) 

Iai 

0.529  (0.117) 

0.610  (0.059) 

0.699  (0.007) 

Ia2 

0.519 

0.591 

0.663 
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Table  2:  Theoretical  leading  terms  in  asymptotic  expansions  of  calibrating  coefficient  and  coverage  error  correspond¬ 
ing  to  the  standard  iterated  bootstrap  confidence  interval  I\ . 


True  distribution 

Calibrating  coefficient, 

t 

Coverage  error, 
Fi9eh)-a 

Standard  normal,  7^(0, 1) 
Folded  normal,  |i\r(0, 1)| 
Double  exponential,  exp(— |a:|)/2 
Log  normal,  exp(7\r(0, 1)) 

3.109  n-i 

6.498 

1.206  X  10 n-i 

1.411  X  10®n-i 

-1.499  X  102n-2 
-1.370  X  10^11-2 
-1.240  X  10^  n-2 
-2.488  X  102°  n-2 

The  simulation  results  are  reported  in  Table  1.  For 
the  interval  J3,  the  proportion  of  simulations  for  which 
the  end-points  were  erratic  is  given  in  parentheses.  This 
proportion,  though  substantial  for  n  =  20,  diminishes 
to  negligible  values  for  larger  sample  sizes,  except  for 
the  log  normal  case.  For  the  interval  />ii,  the  propor¬ 
tion  of  occasions  when  resampling  was  used  instead  of 
the  analytic  approximation  is  given  in  parentheses:  this 
proportion  also  diminishes  rapidly  with  n.  The  figure  in 
parentheses  after  each  coverage  value  for  the  sequential 
interval  is  the  average  number  of  second  level  bootstrap 
samples  drawn,  to  be  compared  with  the  fixed  number 
C  =  1000  of  the  conventional  interval  Ii . 

The  results  show  very  clearly  the  effect  of  iteration 
on  the  coverage  accuracy  of  the  intervals.  Overall,  the 
full-blown  interval  1%  offers  the  best  coverage  accuracy, 
though  all  the  approximate  intervals  considered  offer  rea¬ 
sonable  approximations,  in  terms  of  coverage  accuracy, 
to  that  interval.  As  expected,  the  crude  analytic  interval 
742  displays  discernibly  poorer  coverage  accuracy  than 
the  interval  Iai  it  directly  approximates. 

Considering  the  sequential  interval  J, ,  we  note  the  re¬ 
quirement  of  slightly  fewer  inner  level  resamples  as  the 
sample  size  n  increases.  Also,  the  number  of  sequential 
resamples  depends  slightly  on  the  underlying  distribu¬ 
tion.  Nevertheless,  we  observe  that  the  computational 
savings  due  to  drawing  the  inner  level  resamples  sequen¬ 
tially  are  not  much  affected  by  the  underlying  distribu¬ 
tion,  compared  to  the  other  intervals  considered. 

Without  giving  full  timing  comparisons,  we  note  that 
the  computational  savings  through  use  of  the  sisymp- 
totic  interval  J3  depend  on  the  proportion  of  times  that 
adjustment  of  that  interval  is  required.  Relative  to 
the  interval  Iz  is  most  computationally  advantageous  for 
larger  n  and  normal-type  underlying  populations.  Use 
of  Iz  can  reduce  computation  relative  to  I2  by  as  little 
as  a  factor  of  2,  for  n  =  20  in  the  log-normal  case,  or 
as  much  as  150  or  so,  for  n  =  100  and  a  normal  distri¬ 
bution.  Compared  to  the  sequential  interval  use  of 


Iz  reduces  computation  by  a  factor  of  at  least  250  for 
n  =  20:  for  n  =  100  this  factor  increases  dramatically  to 
around  15000,  except  for  the  log-normal  case  where  the 
factor  remains  of  the  order  400.  The  sequential  interval 
/,  requires  about  3  times  the  amount  of  computation  of 
the  analytic  interval  lA2i  uniformly  over  the  cases  con¬ 
sidered  in  the  simulation.  As  we  have  previously  noted 
we  might  expect,  the  computational  savings  through  use 
of  the  analytic  interval  Iai  are  very  variable.  Relative  to 
which  we  have  already  noted  provides  fairly  uniform 
savings,  requiring  about  1/5  of  the  computation  of  the 
full-blown  interval  hy  Iai  can  vary  from  requiring  about 
twice  as  much  computation  to  requiring  only  about  1/3 
as  much  computation,  depending  on  the  sample  size  and 
underlying  distribution. 

It  is  to  be  noted  that,  even  with  iteration,  the  cover¬ 
age  error  is  often  very  large,  especially  for  the  log  normal 
underlying  distribution.  To  illustrate  further  the  impact 
on  coverage  error  of  different  distributions,  we  have  com¬ 
puted  the  theoretical  leading  terms  of  the  expansions  of 
the  calibrating  coefficient  t  and  of  the  coverage  error,  for 
the  theoretical  iterated  bootstrap  confidence  interval  7i. 
Note  that  all  the  iterated  intervals  considered  here  have 
coverage  error  of  order  0(n“^),  while  the  uncorrected 
interval  7o  has  coverage  error  of  order  Results 

are  listed  in  Table  2.  We  can  readily  appreciate  why  the 
log  normal  distribution  yields  large  coverage  error,  and 
why  the  bootstrap  iteration  has  relatively  little  success 
in  eliminating  coverage  error  in  this  case. 

In  terms  of  coverage  accuracy,  the  asymptotic  calibra¬ 
tion  proves  very  effective,  and  is  also  by  far  the  best  of 
the  intervals  considered  in  terms  of  computational  speed. 
The  interval  Iz  generally  provides  worthwhile  computa¬ 
tional  savings  over  72.  The  interval  Iz  is  perhaps,  there¬ 
fore,  to  be  favoured  overall.  In  the  variance  example 
considered  here,  use  of  the  asymptotic  interval  Iz  re¬ 
duces  computation  by  a  factor  of  1000s,  compared  to  7i, 
whatever  the  sample  size  or  parent  population. 
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Abstract 

Two  effective  variance  reduction  techniques  for  estimat¬ 
ing  probabilities  and  quantiles  in  the  tails  of  bootstrap 
distributions  —  importance  sampling  and  concomitants 
of  order  statistics  —  are  based  on  linear  approximation- 
s.  Although  these  techniques  offer  potential  variance  re¬ 
ductions  by  factors  from  nine  to  infinity,  in  practice  the 
reductions  may  be  only  by  a  factor  of  two  or  smaller, 
because  of  inaccurate  linear  approximations. 

We  develop  tail-specific  linear  approximations  that  are 
more  accurate  where  the  accuracy  is  important,  in  the 
tails  of  distributions.  Our  methods  fall  into  two  cate¬ 
gories  ~  influence  function  methods  and  regression  meth¬ 
ods.  Both  can  be  applied  without  problem-specific  ana¬ 
lytical  calculations,  and  both  have  tail-specific  versions . 

We  apply  the  tail-specific  approximations  to  impor¬ 
tance  sampling  and  concomitants,  and  propose  anoth¬ 
er  technique  that  uses  linear  approximations,  post¬ 
stratification  implemented  using  the  saddlepoint.  This 
technique  shares  the  same  variance  as  the 

concomitants  procedure. 

Keywords:  Concomitants  of  order  statistics,  stratified 
sampling,  empirical  influence  function,  importance  sam¬ 
pling,  jackknife,  variance  reduction. 

1  Introduction 

This  article  concerns  more  efficient  computational  meth¬ 
ods  for  estimating  tail  probabilities  and  percentiles  of 
bootstrap  distributions.  The  primary  focus  of  this  ar¬ 
ticle  is  the  development  of  tail-specific  linear  approxi¬ 
mations  for  bootstrap  statistics.  Such  approximations 
do  not  stand  on  their  own,  but  allow  more  effective  use 
of  other  methods  such  as  importance  sampling  (Johns 
1988,  Davison  1988)  and  concomitants  of  order  statis¬ 
tics  (Efron  1990  section  5,  Do  and  Hall  1992).  We  al¬ 
so  propose  another  method,  post-stratification  using  the 
saddlepoint. 


We  concentrate  on  the  nonparametric  bootstrap; 
see  e.g,  Efron  (1982,  1987),  Efron  and  Tibshirani 
(1993)  for  further  discussion.  The  original  data  is 
X  =  (ici,  0^2, . . . ,  a  sample  from  an  unknown  dis¬ 
tribution  (which  may  be  multivariate).  Let  X*  = 
(X* ,  ^2 , . . . ,  X*)  be  a  “bootstrap”  sample  of  size  n  cho¬ 
sen  with  replacement  from  X.  We  wish  to  estimate  tail 
probabilities  or  quantiles  for  T*  =  T(X*),  which  may 
be  a  parameter  estimate  or  a  pivotal  statistic  used  for 
inferences. 

Let  G{a)  —  P{T*  <  a}  be  the  bootstrap  distri¬ 
bution  function.  The  simple  Monte  Carlo  estimate  of 
G  requires  some  large  number  B  of  bootstrap  sam¬ 
ples  samples  X^  for  b  =  1, . . . ,  S,  then  the  estimate  is 
G{a)  =  (l/5)X;f=i7(7’j*  <  a),  where  I  is  the  usual 
indicator  function  and  =  T(X^)  for  each  such  sam¬ 
ple.  Furthermore,  some  techniques  such  as  the  “iterated 
bootstrap”  (Reran  1987)  require  that  some  number  B2  of 
bootstrap  samples  be  generated  from  each  of  the  original 
B  bootstrap  samples,  requiring  a  total  of  B  +  BB2  boot¬ 
strap  samples.  For  techniques  that  require  tail  probabil¬ 
ity  or  quantile  estimates  the  total  number  of  bootstrap 
samples  required  can  be  quite  large.  For  example,  Efron 
(1987)  finds  that  B  —  1000  observations  are  required  to 
adequately  estimate  tail  quantiles  for  his  (non-iterated) 
confidence  intervals,  and  Booth  and  Hall  (1992)  find  that 
B2  —  KB^f^  is  approximately  optimal  for  some  constan- 
t  K  that  depends  on  the  desired  coverage  level,  with 
K  :=  0.57  for  a  two-sided  95%  confidence  interval;  this 
results  in  a  total  of  about  19,000  bootstrap  replications. 

The  three  variance  reduction  techniques  (importance 
sampling,  concomitants,  and  post-stratification)  can  re¬ 
duce  the  computational  burden  substantially,  but  all  re¬ 
quire  accurate  linear  approximations  to  T*  in  order  to 
work  well.  For  example,  Do  and  Hall  (1992)  show  that 
the  concomitants  procedure  gives  variance  reductions 
that  approach  infinity,  asymptotically,  because  their  lin¬ 
ear  approximations  become  more  accurate  as  n  increas¬ 
es.  But  they  note  that  the  procedure  does  not  do  well 
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when  a  statistic  is  markedly  non-linear.  Similarly,  Efron 
(1990)  reports  variance  reductions  by  factors  of  only  1.8 
for  the  lower  2.5%-tile  and  0.49  (half  as  efficient  as  sim¬ 
ple  random  sampling)  for  the  upper  97.5%-tile  of  the 
law  school  data  of  Efron  (1992,  section  2.5).  Do  and 
Hall  (1992)  further  note  that  the  procedure  does  better 
in  the  center  of  a  distribution  than  in  the  tails.  We  show 
that  the  usual  linear  approximations  are  more  accurate 
in  the  center  of  a  bootstrap  distribution  than  in  the  tails. 

Our  primary  result  in  this  article  is  the  developmen- 
t  of  tail-specific  (actually  quantile-specific)  linear  ap¬ 
proximations,  which  are  more  accurate  near  their  design 
quantiles.  Such  “locaP^  accuracy  is  more  important  to 
the  performance  of  variance  reduction  techniques  than 
is  overall  accuracy. 

We  begin  with  a  general  discussion  of  linear  approxi¬ 
mations  in  section  2.  We  discuss  linear  approximations 
related  to  the  empirical  influence  function  (Efron  1982) 
in  section  3,  and  approximations  based  on  regression  in 
section  4.  In  section  5  we  return  a  topic  raised  in  sec¬ 
tion  2,  that  transforming  T*  may  improve  linearity,  and 
discuss  how  to  choose  a  transformation.  In  section  6  we 
review  the  variance  reduction  techniques  which  can  use 
the  linear  approximations,  and  show  how  sensitive  these 
techniques  are  to  the  quality  of  the  linear  approxima¬ 
tions.  We  propose  a  new  technique,  post-stratification 
using  the  saddlepoint  probability  estimate. 

2  Linear  approximations 

A  ‘^curvilinear  approximation”  to  T*  is  determined  by  a 
vector  L  of  length  n,  with  elements  Lj  corresponding  to 
each  of  the  original  observations  Xj ,  such  that 

i>{T{X*))  =  Y.LjP;  (2.1) 

i=i 

where  ^  is  a  smooth  monotone  increasing  function,  Pj  = 
Mjin,  and  Mj  is  the  number  of  times  is  included  in 
.  For  later  use,  define  L*  to  be  the  right  hand  side  of 
(2.1),  and  let  and  Za  be  the  true  oc  quantiles 

of  the  distributions  of  I-*,  T*,  and  the  standard  normal 
distribution,  respectively. 

For  example,  consider  the  usual  t-statistic 

T{X*)  =  n^l\X*  -x)/sx>,  (2.2) 

where  X*  is  the  sample  average  and  is  the  sample 
variance  of  a  bootstrap  sample  and  x  is  the  sample  av¬ 
erage  of  the  original  data.  We  use  as  X  the  data  of 
Graham  et  al.  (1990):  (  9.6,  10.4,  13.0,  15.0,  16.6,  17.2, 
17.3,  21.8,  24.0,  26.9,  33.8),  for  which  x  =  18.7.  Fix¬ 
ing  the  denominator  of  (2.2)  at  7.3,  the  sample  standard 


deviation  of  the  original  sample,  results  in  a  “central” 
approximation  with  Icentrai,i  =  18.7)/7.3  and 

-^central  =  11^^^(X*  — 18.7)/7.3.  The  first  panel  of  Figure 
1  shows  a  scatterplot  of  T*  vs.  this  T*,  for  1500  boot¬ 
strap  samples.  The  approximation  is  very  accurate  near 
L*  —  Oy  with  very  little  scatter  either  above  or  below  the 
line  T*  =  L*,  but  is  worse  for  L*  farther  from  zero. 

The  increasing  conditional  variability  of  T*  for  L*  far¬ 
ther  from  0  motivates  the  central  theme  of  this  article. 
We  call  the  first  linear  approximation  a  central  approx¬ 
imation  because  it  is  accurate  in  the  center  of  the  boot¬ 
strap  distribution.  Some  other  linear  approximation  may 
be  more  accurate  elsewhere.  Indeed,  the  approximation 
defined  by  i^rightj  =  —15.4  -h  1.02xj  -  0.014a;|  is  more 
accurate  in  the  right  tail,  as  shown  in  the  second  panel  of 
Figure  1.  We  obtain  this  approximation  using  the  right- 
tail  influence  function  method  in  the  next  section;  the 
central  approximation  is  also  equivalent  to  the  central 
influence  function  approximation. 

Furthermore,  the  relationship  between  T*  and  L*  is 
nonlinear,  and  both  versions  of  L*  are  better  approxi¬ 
mations  to  some  transformation  V’(T’*)  than  they  are  to 
T*  itself.  We  discuss  estimation  of  ^  in  section  5,  and 
applications  of  that  estimate;  but  estimating  requires 
a  linear  approximation,  and  that  is  where  we  turn  now. 

3  Influence  Function  and  Knife 
Approximations 

We  begin  in  this  section  by  describing  statistics  T  for 
which  the  linear  approximation  methods  in  this  section 
are  defined,  then  proceed  to  describe  the  general  class 
of  approximations  and  the  specific  approximations.  The 
approximations  differ  in  two  regards  —  whether  they  are 
central  or  tail-specific  approximations,  and  in  the  choice 
of  one  parameter,  which  in  turn  determines  whether  the 
approximation  is  implemented  using  analytically  or  nu¬ 
merically,  and  also  determines  whether  the  approxima¬ 
tion  is  suitable  for  non-smooth  or  only  smooth  functions. 

We  begin  by  writing  T(P*)  =  T{X*),  where  P*  = 
{Pi  j . . . ,  P*);  in  other  words,  any  bootstrap  sample  may 
be  viewed  as  an  empirical  distribution  with  weight  P^  on 
original  observation  Xj .  In  this  section  we  require  that 
T  be  defined  for  all  weight  vectors  P  with  nonnegative 
weights  summing  to  1,  not  just  those  which  are  realizable 
as  bootstrap  samples,  i.e.  those  whose  coordinates  are 
integers/n.  We  say  that  such  T  are  defined  for  weighted 
samples. 

For  example,  we  may  rewrite  the  ^-statistic  (2.2)  as 
T(P)  =  (w  -  l)^/^(ip  -  x)/ap 
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Central  Linear  Approximation 


Right  Linear  Approximation 


Figure  1:  Central  and  right-tail  influence  function  linear  approximations  for  the  ^-statistic. 


where  xp  =  Bj^j  weighted  average  and 

d-p  =  Bj(xj  —  is  the  weighted  standard  devi¬ 
ation  of  a  sample.  Other  examples  of  statistics  which  are 
defined  for  weighted  samples  include  functions  of  sam¬ 
ple  moments  (such  as  means,  variances,  regression  co¬ 
efficients,  and  the  usual  (Pearson)  bivariate  correlation) 
and  statistics  defined  by  estimating  equations  (such  as 
M-estimates  of  location).  Functions  such  as  Spearman’s 
correlation,  which  is  a  function  of  the  ranks  of  obser¬ 
vations,  are  not  defined  for  weighted  samples,  and  the 
regression  methods  in  section  4  should  be  used  for  such 
functions. 

The  linear  approximations  in  this  section  are  of  the 
form 

L,  =  nP.+<<,-p.))-r(p.) 

for  some  point  and  some  e.  These  are  Taylor-series 
or  finite-difference  approximations  to  the  gradient  of  the 
function  T(P);  the  approximation  differ  in  the  choice  of 
Pc  and  the  choice  of  e. 

Central  linear  approximations  use  Pc  =  Po  = 
(1, 1, . .  .)/n,  which  corresponds  to  the  original  data,  and 
one  of  four  choices  of  e; 

^negative  jackknife  •  C  =  ~l/(n  —  1) 

I^influence  function  •  C  >•  0 
^positive  jackknife  •  6  =  l/(n  +  1) 


I'butcher  knife  •  ^  (3.2) 

The  first  three  are  the  negative  jackknife,  influence 
function  (or  infinitesimal  jackknife),  and  positive  jack¬ 
knife  approximations  of  Efron  (1982).  The  butcher 
knife  (large  jackknife)  is  motivated  by  the  observation 
of  Efron  (1982)  that  the  jackknife  uses  T  evaluated  at 
points  which  very  close  to  Pq,  with  a  squared  distance 
of  |P  —  Pop  =  l/(n(n  —  1)),  whereas  E'[(P  -  PoP]  = 
(n  —  l)/n^  under  simple  random  (bootstrap)  sampling. 
The  butcher  knife  matches  the  expected  squared  dis¬ 
tance,  and  so  may  be  a  more  accurate  approximation 
to  the  bootstrap  distribution. 

3.1  Tail-specific  methods 

Tail-specific  linear  approximations  are  also  defined  us¬ 
ing  (3.1),  using  the  same  choices  of  e  (3.2),  but  with 
the  Taylor-series  or  finite-difference  approximation  per¬ 
formed  about  a  different  initial  point  Pc  =  Pa-  We 
choose  Pc^  so  that  T(Pa)  =  that  otherwise 

Pa  is  as  close  to  Pq  as  possible. 

If  n  is  very  large,  a  suitable  initial  point  is 

Pa  —  Pod"  cFicentral  >  (^*^) 

where  Lcentrai  is  a  central  linear  approximation  (normal- 
ized  to  sum  to  0)  and  c  =  Zan-\J2J=i 
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If  n  is  not  large,  or  if  Lcentral  is  skewed,  we  recommend 
instead  to  use  exponential  tilting. 

Pa  =  P[r],  where  Pj^[r]  =  A:exp(rLcentrai,i))  (3-4) 

and  where  fc  is  a  normalizing  constant  and  r  solves 
Lcentral  ’  Pa  =  with  estimated  from  a  nor¬ 

mal,  Cornish- Fisher,  or  saddlepoint  approximation.  The 
right  panel  of  Figure  1  shows  the  right-tail  (a  =  0.975) 
influence  function  estimate,  using  exponential  tilting. 

3.2  Comparing  influence  function  and 
knife  methods 

The  four  choices  of  e  in  (3.2)  result  in  approximations 
which  differ  in  implementation  details,  in  the  kind  of 
problem  where  they  may  be  used,  and  in  accuracy. 

The  knife  approximations  are  evaluated  numerically. 
They  have  the  advantage  that  they  do  not  require  ana¬ 
lytical  calculations,  but  the  disadvantage  of  requiring  n 
evaluations  of  T. 

The  influence  function  L  is  the  gradient  of  the  func¬ 
tion  T'(P)  at  Pc,  and  is  only  suitable  for  statistics  which 
are  smooth  (continuous  and  differentiable)  functions  of 
P;  similarly  for  the  jackknife  versions,  which  are  finite- 
difference  approximations  to  the  gradient.  For  a  discon¬ 
tinuous  function  such  as  the  sample  median  the  influ¬ 
ence  function  estimate  is  undefined  if  n  is  even  and  has 
Lj  =  0  for  all  j  if  n  is  odd;  the  jackknife  methods  may 
have  Lj  =  0  for  all  j  if  there  are  repeated  observations 
at  the  median.  The  butcher  knife  is  a  finite- difference 
method,  but  evaluated  at  points  farther  from  Pc,  and 
can  be  used  for  non-smooth  functions. 

For  statistics  which  are  smooth  functions  of  P  there 
are  subtle  but  significant  differences  between  the  four 
methods.  The  correlations  between  T*  and  the  nega¬ 
tive  jackknife,  influence  function,  positive  jackknife,  and 
butcher  knife  linear  approximations,  respectively,  are 
0.942,  0.955,  0.957,  and  0.955,  for  our  f-statistic  exam¬ 
ple.  The  correlations  between  a  nonpar ametric  estimate 
^(T*)  and  the  approximations  are  higher  (correlation 
.987  with  the  influence  function  approximation),  but  fol¬ 
low  a  similar  pattern  —  only  the  negative  jackknife  does 
appreciably  worse  than  the  others.  But  even  though  the 
butcher  knife  approximation  is  as  good  overall  as  the  in¬ 
fluence  function  approximation,  it  is  not  quite  as  good 
“locally”,  for  values  of  L*  near  targets  L  •  Pc-  This  dif¬ 
ference  axises  from  the  way  the  approximations  are  de¬ 
fined  —  the  influence  function  is  determined  by  a  local 
approximation  to  T,  the  butcher  knife  by  a  more  glob¬ 
al  approximation.  Local  accuracy  (in  the  tails)  is  more 
important  for  the  variance  reduction  techniques  than  is 
global  accuracy,  so  we  recommend  the  influence  function 


(or  positive  jackknife  approximation)  for  smooth  statis¬ 
tics. 

4  Regression  approximations 

Linear  regression  methods  may  be  used  to  obtain  linear 
approximations  for  any  statistic,  even  those  not  defined 
for  weighted  samples.  We  begin  with  central  regression 
methods,  and  follow  with  tail-specific  methods. 

Let  Mbj  be  the  number  of  times  original  observation 
Xj  is  included  in  the  6^th  sample  and  let  =  Mbj/n. 
Run  a  linear  regression  without  an  intercept  of  the  form 

n 

Tj  =  ^  I3j  Pi  J  +  residuali ,  (4.1) 

i=i 

and  let 

Lcentral  J  ^  l^j  P  (4.2) 

where  P  =  Pi^  intercept  is  omitted  be¬ 

cause  otherwise  the  regression  would  be  singular.  This 
linear  approximation  was  obtained  by  Efron  (1990). 

Do  and  Hall  (1992)  propose  an  alternative  which  is  an 
approximation  to  least-squares  estimation  in  (4.1),  but 
suffers  from  greater  sampling  variability  in  the  terms  of 
L,  as  shown  in  Figure  2.  That  sampling  variability  trans¬ 
lates  into  less  accurate  curvilinear  relationships  between 
T*  and  L\ 

4.1  Tail-specific  regression  approxima¬ 
tions 

We  considered  a  variety  of  local  and  global  regression 
methods  for  tail-specific  approximations.  The  local 
methods  did  not  work  well,  suffering  from  excessive  sam¬ 
pling  variability,  because  they  are  based  on  only  a  small 
fraction  of  all  bootstrap  samples  (e.g.  the  2aJ5o  replica¬ 
tions  with  the  largest  values  of  T*).  Global  regression 
methods  suffer  from  other  problems,  but  those  can  be 
fixed. 

Our  global  regression  principle  is  to  fit  a  nonlinear 
surface  using  all  the  bootstrap  samples,  and  use  the  gra¬ 
dient  of  that  surface  at  an  appropriate  point  in  the  tail. 
However,  fitting  a  quadratic  relationship  between  T*  and 
P*  requires  estimating  n  —  1  first  derivatives  (P  is  of 
rank  n  —  1)  and  n(n  —  l)/2  second  derivatives,  which 
may  be  too  many  coefficients  to  estimate  with  a  modest 
bootstrap  sample  size.  Our  solution  is  to  fit  a  restrict¬ 
ed  quadratic  surface,  estimating  only  the  linear  combi¬ 
nation  of  second  derivatives  which  affects  the  gradien- 
t  at  the  tail  evaluation  point.  Details  of  the  restrict¬ 
ed  quadratic  fitting  are  available  from  the  author.  It 
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Influence  Function  L  Influence  Function  L 

Figure  2:  Central  Regression  L  for  the  ^-statistic  and  an  approximation. 

10  vectors  L,  each  estimated  from  500  bootstrap  samples. 


works  moderately  well,  but  suffers  from  sampling  vari¬ 
ability.  Estimating  the  nonlinear  transformation  ^  and 
fitting  ^(r*)  (instead  of  T*)  as  a  restricted  quadratic 
function  of  P*  reduces  the  sampling  variability  consid¬ 
erably,  with  improvement  comparable  to  using  the  full 
regression  rather  than  the  approximation  in  Figure  2. 

5  Transformations  of  T* 

We  have  nearly  concluded  our  discussion  of  linear  ap¬ 
proximations,  and  are  about  ready  to  turn  to  applica¬ 
tions.  But  first  we  turn  to  a  topic  which  applies  to  both 
the  approximations  and  applications,  that  of  estimat¬ 
ing  the  nonlinear  transformation  in  (2.1).  The 

butcher  knife  and  regression  approximations  (and  to  a 
lesser  extent  the  jackknife  approximations)  are  not  in¬ 
variant  under  nonlinear  transformations,  so  we  can  im¬ 
prove  those  estimates  by  replacing  T  with  an  estimate 
^(T).  And  the  concomitants  technique  below  is  not  e- 
quivariant  under  nonlinear  transformations,  and  can  be 
improved  considerably  using  an  estimate  of  V*-  A  third 
use  for  a  transformation  is  for  improving  the  bootstrap-^ 
confidence  interval;  see  Tibshirani  (1988). 

We  propose  two  ways  to  estimate  V',  one  deterministic, 
the  other  based  on  bootstrap  observations.  The  former 
requires  extra  evaluations  of  T* ,  while  the  latter  gives 


estimates  which  are  subject  to  sampling  variability. 

The  deterministic  procedure  is  to  estimate  ^  by  in¬ 
terpolating  Z  =  P  •  L  (as  y)  against  T(P)  (as  x),  for 
points  P  determined  using  exponential  tilting  (3.4);  ev¬ 
ery  distinct  value  of  the  tilting  parameter  r  results  in 
one  training  point  for  the  interpolation. 

The  nondeterministic  procedure  is  to  estimate  if)  using 
a  scatterplot  smooth  or  nonlinear  regression  of  L*  (as  y) 
against  T*,  for  Bq  bootstrap  samples.  This  is  motivated 
by  the  ACE  algorithm  of  Breiman  and  Friedman  (1985), 
and  is  nearly  invariant  under  subject  to  limitations 
of  the  smoothing  method.  The  estimate  should  then  be 
rescaled  so  that  ^'(r(Po))  =  1. 

Both  the  deterministic  and  nondeterministic  transfor¬ 
mations  nicely  remove  the  curvilinearity  observed  in  Fig¬ 
ure  1,  and  give  much  better  regression  linear  approxi¬ 
mations,  with  sampling  variability  in  the  elements  of  L 
smaller  by  factors  of  approximately  four  and  six  for  the 
central  and  tail-specific  approximations,  respectively. 

6  Variance  Reduction 

We  have  three  purposes  in  this  section  —  to  justify  the 
effort  put  into  accurate  linear  approximations  by  show¬ 
ing  how  sensitive  variance  reduction  techniques  are  to 
that  accuracy,  to  indicate  when  tail-specific  approxima- 
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tions  will  give  a  significant  improvement,  and  to  propose 
a  new  variance  reduction  technique  that  uses  hnear  ap¬ 
proximations. 

Figure  3  shows  the  relative  efficiency  for  the  three  vari¬ 
ance  reduction  techniques  as  a  function  of  the  correlation 
between  T*  and  L* .  Note  how  quickly  the  concomitants 
and  stratified  sampling  procedures  lose  efficiency  as  the 
correlation  drops.  This  figure  indicates  why  more  accu¬ 
rate  linear  approximations  are  worth  pursuing. 

The  “efficiency”  in  Figure  3  and  later  is  the  efficiency 
of  a  technique  relative  to  simple  bootstrap  sampling,  for 
estimating  tail  probabilities  corresponding  to  the  0.025 
and  0.975  percentiles  of  the  distribution  of  T*.  See  Hall 
(1991)  and  Johns  (1988)  for  asymptotic  results  in  special 
cases  that  indicate  that  this  efficiency  is  asymptotically 
equivalent  to  the  efficiency  for  estimating  the  percentiles 
themselves. 

We  assume  in  Figure  3  that  the  relationship  between 
T*  and  L*  is  linear,  that  T*  is  a  central  approximation, 
and  that  the  distribution  of  L*  is  normal,  but  do  not 
assume  that  the  joint  distribution  is  bivariate  normal  — 
it  was  very  clearly  not  in  Figure  1.  Instead,  assuming 
that  T*  can  be  written  as  a  smooth  function  of  sample 
means,  we  find  that  the  joint  distribution  is  of  the  form 

D 

tp{T*)  =  T*  +  {L*  - 

(6.1) 

(i=l 

for  some  V',  where  (L*  —  Lq)  =  the  Zd  are 

independent  standard  normal  random  variables,  io  de¬ 
pends  on  the  linear  approximation,  and  D  is  a  smal- 
1  integer  that  depends  on  the  statistic,  not  on  n.  We 
omit  the  formal  statement  and  proof  of  this  theorem  in 
this  version  of  this  article.  It  turns  out  that  tail-specific 
linear  approximations  work  best  when  the  a’s  are  large 
compared  to  the  6^s,  and  it  is  clear  from  (6.1)  that  the 
conditional  variance  of  i/>(r*)  given  L*  is  smallest  in  the 
center  when  any  of  the  a^s  are  nonzero.  So  a  scatterplot 
of  T*  against  L*  can  be  used  to  diagnose  if  a  tail-specific 
approximation  may  be  useful,  without  actually  comput¬ 
ing  it;  a  small  conditional  variance  in  the  center,  as  in 
Figure  1,  indicates  that  a  tail-specific  approximation  can 
help. 

The  two  panels  in  Figure  3  correspond  to  the  “het- 
eroskedastic  normal”  case  where  5^  =  0  for  d  =  1, . . .  D, 
and  the  “single-x^”  where  6i  /  0  and  the  other 
a’s  and  5’s  are  zero.  In  the  heteroskedastic  normal  case 
the  conditional  distribution  of  'ipiT*)  given  L*  is  normal 
with  standard  deviation  proportional  to  \L*—Lo\;  we  see 
similar  behavior  in  Figure  1.  The  heteroskedastic  normal 


case  is  interesting  because  it  is  offers  the  greatest  poten¬ 
tial  for  tail-specific  linear  approximations,  and  because 
without  tail-specific  approximations  it  is  a  hard  case  for 
the  concomitants  and  stratified  sampling  procedures. 

The  single-x^  case  is  also  interesting  because  it  has 
the  heaviest  tails  of  all  conditional  distributions  of  V’(^*) 
given  L*,  for  fixed  p  and  distributions  in  the  family  (6.1). 
That  those  heavy  tails  are  a  problem  for  importance  sam¬ 
pling  is  apparent  in  the  right  panel  of  Figure  2.  Unfortu¬ 
nately,  tail-specific  approximations  offer  little  help  here, 
but  the  damage  can  be  mitigated  by  using  more  conser¬ 
vative  importance  sampling. 

6.1  Concomitants  of  order  statistics 

For  simplicity  of  notation,  sort  the  bootstrap  samples  by 
the  values  of  LJ .  Then  the  concomitants  estimate  of  the 
bootstrap  distribution  is  G{a)  =  0-/ —  ®)* 
where 

=  t;  +  (6.2) 

where  L\  is  an  estimate  of  the  (6-0.5)/5  quantile  of  the 
distribution  of  L* .  Efron  (1990)  lets  be  the  b  th  nor¬ 
mal  score  ^““^((6  -  0.5)/J9),  iteratively  transformed  us¬ 
ing  cubic  Cornish-Fisher  transformations  so  that  the  first 
four  sample  moments  match  the  theoretical  moments  of 
L*,  but  suggests  that  letting  lJ  be  the  saddlepoint  esti¬ 
mate  of  would  be  more  accurate. 

In  the  case  that  L*  and  T*  are  jointly  continuous  we 
find  the  asymptotic  variance  of  the  concomitants  esti¬ 
mate  to  be 

5Var(G(a))  =  J  /10(1  -  H{a  -  l\l))fil)dl 

+2 1  H'{a-h\h)H’{a-l2\l2) 

Fih)il-F{h))dhdl2  (6.3) 

where  F  and  /  are  the  distribution  and  density  func¬ 
tions  of  L\  Hia\l)  =  P(r*  <  a\L*  =  /),  and  H'{a\l)  = 
Note,  in  Figure  3,  how  strongly  the  efficiency 
depends  on  the  correlation  between  L*  and  T* . 

The  concomitants  procedure  is  not  invariant  under 
nonlinear  transformations  T*.  If  L*  is  a  good  approx¬ 
imation  to  some  transformation  i/>(T*)  rather  than  to 
T*  the  double  integral  in  (6.3)  can  be  substantial.  E- 
fron  (1990)  replaced  the  right  side  of  (6.2)  with  + 
^""^(lI)  — '^“^(XJ)  for  the  t-statistic,  with  estimat¬ 
ed  using  cubic  regression  of  T*  against  Z-* .  We  suggest 
replacing  the  right  side  of  (6.2)  with  ^"’^(^(Tj )  -h  - 
LI)  instead,  which  is  invariant  under  transformations  of 
T*  (up  to  limitations  of  the  procedure  used  to  estimate 
V>). 
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Heteroskedastic  Normal  Case 


Correlation  between  T*  and  L* 


Single  Chi-squared  Case 


Correlation  between  and  L* 


Figure  3:  Efficiency  for  Tail  Estimation  as  a  function  of  the  correlation  between  L*  and  T*. 


We  show  three  variations  of  the  concomitants  proce¬ 
dure  in  Table  1.  The  first  variation  uses  (6.2),  the  sec¬ 
ond  uses  Efron’s  procedure  with  cubic  regression,  and 
the  third  uses  the  invariant  procedure  with  ^  estimat¬ 
ed  using  the  deterministic  procedure  in  section  5.  In  all 
cases  we  use  the  saddlepoint  estimate  of  the  inverse  cu¬ 
mulative  distribution  function  of  Z*  (Hesterberg  1994) 
to  determine  L\,  The  transformations  do  result  in  higher 
efficiency,  with  the  invariant  procedure  performing  best, 
but  the  biggest  improvement  is  obtained  by  using  tail- 
specific  linear  approximations  rather  than  central  linear 
approximations.  The  efficiencies  using  tail-specific  ap¬ 
proximations  are  three  to  four  times  higher  than  those 
obtained  using  central  approximations. 

6.2  Importance  sampling 

Importance  sampling  uses  bootstrap  samples  of  size  n 
generated  from  a  distribution  g  rather  than  by  sim¬ 
ple  random  sampling  /,  and  places  weight  = 
on  to  counteract  the  sampling 
bias,  resulting  in  the  distribution  function  estimate 


center;  see  Hesterberg  (1988,  1991)  for  further  discus¬ 
sion  and  methods  suitable  for  problems  other  than  tail 
estimation. 

We  consider  sampling  distributions  g  of  the  form 
9Am  =  d-  Ai^i(Af*)  +  \292{X^) 

where  the  A’s  are  nonnegative  mixing  proportions  that 
add  to  1  and  gk  indicates  sampling  with  unequal  proba¬ 
bilities  P{Xi  =:  Xj}  =  CktXY>{rkLj^k)  for  i  =  1,2  respec¬ 
tively,  where  the  Ck  are  normalizing  constants.  This  is  a 
combination  of  exponential  tilting  (Johns  1988,  Davison 
1988)  with  defensive  mixture  distributions  (Hesterberg 
1988,  1991),  and  we  stratify  the  mixing  proportions  by 
drawing  exactly  Bq  =  \qB  bootstrap  samples  using  sim¬ 
ple  random  sampling  and  BX^  samples  using  g^..  The 
weight 


Ao  d-  AicJ  exp(nriL5)  -f  A2C5  exp(nr2L2) 

is  independent  of  which  distribution  was  used  to  generate 
sample  6.  The  variance  of  the  distribution  estimate  is 


<  a)  left  tail 

1 1  -  Ef=i  >  a)  right  tail 


(6.4) 


The  multi-part  definition  is  necessary  because  the 
weights  do  not  add  to  1,  and  G  is  inaccurate  in  the 


^  1 

Var(G(a))  =  —  ]^AjfeVar^,(W"/(r*  exceeds  a)), 

^  ib=0 

where  “exceeds”  means  “<”  and  “>”  for  right  and  left 
tails,  respectively. 


TIC.  Hesterberg 
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Left  Tail- 
Central 

-(a  =  0.025) 
Tail-specific 

Right  Tail- 
Central 

-(a  =  0.975) 
Tail-specific 

Concomitants  1 

1.5 

5.6 

2.5 

8.6 

Concomitants  2 

2.0 

6.3 

2.4 

9.0 

Concomitants  3 

2.1 

8.0 

2.5 

10.8 

Importance  Sampling  1 

8.0 

16.6 

Importance  Sampling  2 

5.3 

12.5 

4.9 

8.9 

Importance  Sampling  3 

4.4 

9.8 

4.3 

7.2 

Post-Stratification 

1.5 

4.7 

1.8 

5.6 

Table  1:  Efficiency  using  Central  and  Tail-Specific  Linear  Approximations 

Estimated  efficiency  for  estimating  tail  probabilities,  for  B  =  200,  for  the  f-statistic,  using  the  central  and  tail- 
specific  influence  function  linear  approximations.  The  importance  sampling  distributions  parallel  those  in  Figure  3, 
but  with  values  of  r  chosen  so  that  L  •  using  saddlepoint  quantiles  (Hesterberg  1994).  The  estimates  are 

based  on  2000  bootstrap  experiments.  Standard  errors  are  less  than  4%  of  the  estimates,  except  for  the  tail-specific 
post-stratification  estimates  (less  than  7%). 


The  first  sampling  distribution  in  Figure  3  uses  sim¬ 
ple  exponential  tilting  (A2  =  1)  rather  than  a  mixture, 
with  r  =  2.18/Var(L*),  which  Johns  (1988)  finds  to  be 
optimal  for  the  97.5  %-tile  when  T*  =  L*  and  L*  is 
normal.  This  is  a  very  anti- conservative  sampling  dis¬ 
tribution,  as  is  practically  unbounded,  and  is  not 
robust  to  imperfect  linear  approximations,  particularly 
in  the  heavy-tailed  single-x^  case. 

The  second  distribution  uses  jBi  =  152  =  B/2  boot¬ 
strap  samples  computed  using  exponential  tilting  for 
the  two  tails,  with  — ri  =  r2  =  Za/Var(L*).  The 
third  uses  So  =  -B/S  simple  random  bootstrap  sam¬ 
ples,  and  Si  =  S2  =  0.4S  samples  using  the  same  n 
and  r2.  These  distributions  are  more  conservative  (with 
W},  bounded  above  by  approximately  6.8/S  and 
respectively),  and  do  slightly  worse  when  p  =  1,  but 
do  much  better  for  smaller  p.  A  description  of  optimal 
choice  of  the  r’s  and  A^s  is  beyond  the  scope  of  this 
article,  but  we  will  note  that  the  second  and  third  dis¬ 
tributions  are  more  conservative  (larger  Ao  and  smaller 
r’s)  for  p  >  0.95  than  is  optimal  if  (4.5)  holds  exact¬ 
ly,  but  do  offer  insurance  against  the  effect  of  cubic  and 
higher  deviations  from  linearity,  which  result  in  heavier 
tails  than  a  random  variable. 

Two  practical  considerations  argue  in  favor  of  the  sec¬ 
ond  or  third  importance  sampling  distributions.  Many 
bootstrap  methods  require  estimates  of  quantiles  from 
both  tails;  a  mixture  that  incorporates  ^fieft  and  bright 
is  a  robust  and  more  efficient  alternative  to  performing 
separate  simulations  for  each  tail.  Second,  if  importance 
sampling  is  combined  with  linear  regression  methods  for 
determining  L,  the  Bo  bootstrap  samples  from  /  may  be 
used  as  the  training  set  for  the  regression. 


The  tail-specific  linear  approximations  result  in  sub¬ 
stantially  better  efficiency  in  Table  1,  roughly  by  a  factor 
of  two. 


6.3  Post-stratification 

We  propose  J(L*  <  as  a  variable  for  post¬ 

stratification,  and  let 


B 

Gia)  =  J2WiI(T:  <  a) 

4=1 

be  the  empirical  distribution  formed  by  placing  weight 


Wi  = 


if  LX  < 
if  LI  > 


(6.5) 


on  Tj* .  The  estimate  is  unbiased  with  variance 


Var(G(a))  = 


+ 


h{P{T*  <  a\L*  <  L’W}) 
#(!,*  <  L*(«)) 
h{P{T*  <  a\L*  > 

(#L»  >  !*(<»)) 


conditional  on  #(!■*  <  L*^°‘^),  where  h{jp)  =  p(l  —  p)  is 
the  variance  of  a  Bernoulli  random  variable  with  mean 
p,  and  is  asymptotically  normal  with  asymptotic  stan¬ 
dardized  variance 


a-^h{P{T*  <  a\L*  < 

+(1  -  a)-^h{P{T  <  a\L*  > 

as  B  —*  00.  At  a  =  this  reduces  to  2pi  2  — 

Pi,2/(“(l-«))  where pi, 2  =  P{L*  <  >  T*W}. 
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Under  fairly  general  conditions,  including  the  smooth 
functions  of  means  model  considered  in  Do  &  Hal- 
1  (1992),  the  errors  in  linear  approximations  are  such 
that  pi^2  =  Thus  at  a  =  this  post- 

stratification  estimator  shares  the  same  factor  of 
in  the  variance  as  the  concomitants  procedure.  The  pro¬ 
cess  may  be  repeated  for  for  every  level  a  for  which  an 
estimate  is  desired. 

Post-stratification  requires  an  estimate  of  which 
we  obtain  using  the  saddlepoint  quantile  estimate  of  H- 
esterberg  (1994),  based  on  the  saddlepoint  formula  of 
Lugannani  and  Rice  (Daniels  1987).  Davison  and  Hink- 
ley  (1988)  use  the  saddlepoint  for  linear  bootstrap  prob¬ 
lems;  post-stratification  uses  a  linear  approximation  for 
nonlinear  problems. 

Post-stratification  does  not  do  as  well  as  either  con¬ 
comitants  or  importance  sampling  in  Table  1.  On  the 
other  hand,  it  is  substantially  simpler  than  those  proce¬ 
dures.  It  uses  simple  random  sampling,  does  not  require 
an  estimate  of  and  requires  only  one  saddlepoint  es¬ 
timate  (of  for  each  quantile  desired. 

7  Conclusion 

The  central  and  tail-specific  influence  function  and  pos¬ 
itive  jackknife  linear  approximations  work  well  in  the 
^-statistic  example  and  in  other  problems  we  investigat¬ 
ed  where  the  bootstrap  statistic  T  can  be  written  as  a 
smooth  function  of  weights,  including  the  bivariate  cor¬ 
relation  coefficient  and  sample  variance,  although  in  the 
latter  cases  the  gains  from  tail-specific  approximation- 
s  are  less  than  with  the  ;f-statistic.  The  choice  between 
the  influence  function  &,nd  positive  jackknife  reduces  to  a 
question  of  implementation,  whether  the  analytical  cal¬ 
culations  required  by  the  influence  function  or  the  nu¬ 
merical  calculations  required  by  the  jackknife  are  easier. 

The  butcher  knife  worked  nearly  as  well  in  the  smooth 
function  problems,  and  also  worked  for  the  sample  medi¬ 
an  and  25%  trimmed  mean.  However,  the  butcher  knife 
is  a  radical  proposal.  Where  each  numerical  calcula¬ 
tion  for  the  (central)  positive  jackknife  is  equivalent  to 
repeating  a  single  observation  twice,  the  butcher  knife 
corresponds  to  repeating  the  observation  times;  this 
would  be  too  many  for  statistics  which  are  sensitive  to 
the  number  of  times  observations  are  repeated. 

The  central  regression  approximation  calculated  from 
T*  worked  reasonably  well,  but  the  tail-specific  version 
suffered  from  excessive  sampling  variability.  Both  ver¬ 
sions  were  significantly  improved  by  replacing  T*  with 

If  estimates  for  multiple  values  of  a  in  each  tail  are 
needed  it  should  suffice  to  use  a  single  tail-specific  linear 


approximation  for  all.  Particularly  for  importance  sam¬ 
pling  it  would  be  impractical  to  use  multiple  approxima¬ 
tions  and  multiple  sampling  distributions  for  each  tail. 

Simulation  Details 

Simulations  are  run  in  S  (version  S-PLUS  3.0)  (Becker  et. 
al.,  1988;  Statistical  Sciences,  1991)  and  C,  using  the  Su¬ 
per  Duper  random  number  generator  of  Marsaglia,  using 
common  random  numbers  with  the  original  observations 
sorted.  Antithetic  variates  and  balancing  are  not  used. 
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Abstract 

Perhaps  the  most  popular  Markov  chain  Monte  Carlo 
method  from  the  class  of  Hastings-Metropolis  algo¬ 
rithms,  is  the  symmetric  random  walk  Metropolis  algo¬ 
rithm.  This  paper  will  discuss  some  of  its  theoretical 
properties.  Conditions  ensuring  geometric  convergence 
of  the  algorithm  will  be  given,  in  terms  of  smootheness 
and  exponential  decay  conditions  on  the  target  distribu¬ 
tion,  and  an  example  where  geometric  ergodicity  does 
not  happen  is  discussed.  Finally,  recent  results  on  opti¬ 
mal  scaling  of  proposal  kernels  as  a  function  of  dimension 
of  the  target  distribution  will  be  given,  and  the  results 
related  to  overall  acceptance  rates  of  the  algorithm. 

1  Introduction 

It  is  now  well  understood  that  the  convergence  proper¬ 
ties  of  the  Gibbs  sampler  (see  for  example  Gelfand  and 
Smith,  1990),  are  closely  linked  to  the  correlation  struc¬ 
ture  of  functionals  of  coordinate  directions  (see  for  exam¬ 
ple  Amit,  1991,  Hills  and  Smith  1992).  Unfortunately, 
the  Metropolis  algorithm  (Metropolis  et.  al.,  1953)  is 
considerably  less  well  understood.  In  particular,  there 
is  no  obvious  connection  between  convergence  rates  of 
Metropolis  algorithms,  and  the  statistical  propoerties  of 
target  densities.  Although  known  to  work  well  very  of¬ 
ten  in  practice,  Metropolis  algorithms  are  not  automatic 
preocedures  -  a  proposal  density  needs  to  be  chosen  apri- 
ori  -  and  the  choice  of  proposal  can  often  be  critical  to 
the  efficiency  of  the  algorithm. 

Very  little  progress  has  been  made  on  the  problem 
of  choosing  a  proposal,  even  for  the  simplest  algorithm, 
the  random  walk  Metropolis  algorithm.  A  number  of 
authors  have  suggested  scaling  the  proposal  variance  in 
proportion  to  the  variance  of  the  target  density.  In  prac¬ 
tice,  it  is  impossible  to  apriori  obtain  a  reliable  estimate 
of  the  target  variance,  so  that  Tierney  (1991)  and  Muller 
(1993)  suggest  monitoring  the  proportion  of  accepted 
Metropolis  jumps.  This  is  an  appealing  approach  since 


the  acceptance  rate,  Pjumpj  is  extremely  easy  to  moni¬ 
tor.  Muller  observes  that  an  acceptance  rate  of  around 
0.5  often  works  well,  but  can  any  theoretical  justification 
be  given  for  using  such  rules? 

This  paper  will  discuss  two  sets  of  results.  First  of  all, 
we  consider  the  problem  of  determining  when  the  ran¬ 
dom  walk  Metropolis  algorithm  is  geometrically  ergodic. 
It  turns  out  that  geometric  ergodicity  is  related  to  the 
tail  behaviour  of  the  target  density,  and  to  a  curvature 
condition  on  the  contours  of  the  target  density,  but  that 
the  form  of  the  proposal  density  is  (essentially)  unim¬ 
portant.  Section  2  discusses  these  results;  further  details 
and  proofs  appear  in  Roberts  and  Tweedie  (1994a). 

The  second  set  of  results  consider  a  diffusion  approxi¬ 
mation  for  high  dimensional  Metropolis  algorithms  with 
spherically  symmetric  proposal  densities.  The  limiting 
diffusion  process  has  a  speed  measure  which  we  can  in¬ 
terpret  as  the  asymptotic  efficiency  of  the  algorithm. 
This  speed  measure  depends  only  on  the  scaling  of  the 
proposal  density  variance,  and  this  in  turn  can  be  re¬ 
lated  to  Pjump’  Thus,  efficiency  can  be  related  directly 
to  Pjumpi  a-nd  it  turns  out  that  the  optimal  value  for 
Pjump  should  be  somewhere  around  0.25.  Perhaps  more 
usefully  in  practice,  an  acceptance  rate  of  between  0.15 
and  0.4  gives  at  least  85%  of  maximal  possible  efficiency. 
These  results  are  covered  in  Section  3;  further  details  in¬ 
cluding  proofs  and  practical  implications  of  these  results 
appear  in  Roberts  Gelman  and  Gilks  (1994)  and  Gelman 
Roberts  and  Gilks  (1994). 

2  Geometric  convergence  of  the 
Random  walk  Metropolis  algo¬ 
rithm 

We  say  that  a  Markov  chain  X  with  state  space  con¬ 
tained  in  is  geometrically  ergodic  (to  tt)  in  total  vari- 
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aiion  norm,  if  tt  is  a  probability  measure  and 


7r(y)|dy  <  V{'x)p^ 


Vx  EsuppTT.  Here,  P*  denotes  the  ^-step  transition  kernel 
for  X,  V  is  a  real-valued  function,  and  p  <  1. 

We  argue  here  that  geometric  convergence  is  a  min¬ 
imal,  but  important  requirement  that  should  be  satis¬ 
fied  by  a  Markov  chain  Monte  Carlo  algorithm.  Ideally, 
we  would  like  quantitative  bounds  on  V  and  p.  How¬ 
ever  here  we  content  ourselves  with  qualitative  results 
because 


(1)  Quantitative  bounds  for  relatively  complex  prob¬ 
lems  are  extremely  difficult  to  obtain  (although  see 
Rosenthal,  1994). 

(2)  Non  geometric  algorithms  have  heavy  tailed  excur¬ 
sion,  so  have  a  tendency  to  get  stuck.  This  can  also 
make  the  choice  of  starting  value  highly  critical. 

(3)  Geometric  convergence  results  at  least  allow  the  ex¬ 
istence  of  central  limit  theorems  (see  for  example 
Roberts  and  Tweedie,  1994a),  allowing  some  reas¬ 
surance  for  the  use  of  ergodic  estimates  of  Markov 
Chain  Monte  Carlo  output. 

(4)  Surprisingly  perhaps,  many  of  the  algorithms  com¬ 
monly  used  (including  in  some  cases,  the  random 
walk  Metropolis  algorithm)  fail  to  be  geometrically 
ergodic. 

The  following  result  can  be  used  to  demonstrate  ei¬ 
ther  geometric  or  non-geometric  convergence.  We  do  not 
state  it  in  its  most  general  form,  although  we  will  need 
the  following  definitions.  We  say  that  a  set  C  is  small  if 
there  exists  e  >  0,  <  €  N,  and  a  probability  measure  i/(*) 
such  that 

sup  P\x,A)  >  ev{A) 

x€C 

for  all  sets  A.  In  our  context  compact  sets  are  nearly 
always  small,  although  it  is  frequently  possible  for  un¬ 
bounded  sets  to  be  small  also. 

Let  Tc  =  inf{t  >  1;  Xt  E  C}. 


Theorem  1  (Meyn  and  Tweedie,  1993)  The  following 
three  statements  are  all  equivalent. 

(1)  X  is  geometrically  ergodic. 

(2)  (Foster  drift  condition)  There  exists  a  small  set  C, 
a  function  V  ^  A  <  1,  and  6^0  such  that 

Ex[V(Xi)]  <  Ay(x)  -h  bl[x  E  C]. 


(3)  There  exists  a  small  set  C  and  a  constant  /c  >  1 
such  that 

sup  Ex[«^^]  <  oo. 

x€C 

The  second  equivalence  is  most  useful  for  demonstrat¬ 
ing  geometric  convergence  whereas  the  third  is  partic¬ 
ularly  useful  for  establishing  that  an  algorithm  is  not 
geometrically  ergodic.  The  following  result  is  a  simple 
consequence  of  (3). 

Theorem  2  (Roberts  and  Tweedie,  1994a)  A  necessary 
condition  for  geometric  convergence  of  a  Markov  chain 
with  stationary  distribution  tt,  not  concentrated  at  a  sin¬ 
gle  point  is  that 

ess  supP{x,  {x})  <  1. 

Here  the  essential  supremum  is  taken  with  respect  to  the 
stationary  distribution  tt. 

We  will  now  describe  results  for  the  random  walk 
Metropolis  algorithm  which  can  be  derived  from  The¬ 
orems  1  and  2.  For  simplicity  (although  this  is  not  nec¬ 
essary  in  most  of  the  results  that  we  give),  we  shall  as¬ 
sume  that  the  random  walk  step  is  a  spherically  sym¬ 
metric  continuous  distribution,  and  that  the  target  den¬ 
sity  TT  from  which  we  wish  to  sample,  is  a  d-dimensional 
Lebesgue  density  on  R'^.  Let  q  denote  the  proposal  ker¬ 
nel,  and  suppose  cx  be  the  acceptance  probability  of  any 
particular  move.  Therefore,  the  algorithm  proceeds  it¬ 
eratively  as  follows.  Given  X*“^,  choose  Y  according  to 
the  density  g(lY  -  X^""^!).  Accept  Y  and  set  Xt  =  Y 
with  probability 

Otherwise  set  X*  = 

Roberts  and  Tweedie  (1994a)  gives  very  general  con¬ 
ditions  for  geometric  ergodicity  of  Hastings-Metropolis 
algorithms,  and  in  particular  the  random  walk  Metropo¬ 
lis  algorithm.  We  content  ourselves  with  a  brief  summary 
of  the  main  ideas.  Therefore,  regularity  and  smootheness 
conditions  are  omitted,  as  well  as  the  most  general  state¬ 
ment  of  the  result. 

Define 

Ce  =  {x;  7r(x)  =  e} 

to  be  the  contour  of  e  for  (typically)  small  e.  Now  define 
k(€)  to  be  the  supremum  of  the  Ricci  curvature  over  all 
points  on  Ce-  The  Ricci  curvature  is  the  multidimen¬ 
sional  analogue  of  curvature,  and  can  be  described  in 
terms  of  the  curvature  of  the  largest  hypersphere  that 
can  be  inscribed  (locally  at  least)  within  the  interior  of 
the  contour  manifold. 
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Essentially,  the  random  walk  Metropolis  algorithm  is 
geometrically  ergodic  when  the  following  two  conditions 
are  satisfied: 


3  Efficiency  and  scaling  of  pro 
posals 


(A) 


lim  Ce  =  0. 

C-+0 


(B)  There  exists  constants  Aj  B  >  0  such  that 
7r(x)  < 


Under  these  conditions,  the  test  function  V'(x)  = 
7r(x)“^/^  can  be  used  in  the  second  equivalence  of  The¬ 
orem  2. 

The  exponential  decay  condition  (B)  is  close  to  being 
necessary  for  geometric  convergence.  Intuitively,  for  tar¬ 
get  dentities  with  tails  heavier  than  exponential,  there 
is  too  much  mass  in  the  tails  for  a  random  walk  dynamic 
to  be  able  to  see  quickly  enough.  In  one  dimension  (A)  is 
not  relevant,  and  Mengerson  and  Tweedie  (1993)  essen¬ 
tially  demonstrate  necessity  and  and  sufficiency  of  (B) 
for  geometric  convergence  of  the  algorithm. 

Condition  (A)  is  far  from  being  a  necessary  condition, 
although  some  kind  of  restriction  on  /c(e)  for  small  e  is 
necessary  as  the  following  example  demonstrates. 

Example  1  Suppose 

5r(x)  a  exp{-a;^  -  a:V  - 

then  it  is  not  hard  to  show  that  k(6)  — +  oo  as  e  0. 
Therefore  the  density  has  long  ridges  along  the  coordi¬ 
nate  axes.  The  random  walk  Metropolis  algorithm  can 
be  shown  to  be  not  geometrically  ergodic  by  Theorem  2. 
(See  Roberts  and  Tweedie,  1994a  for  further  details). 

However  large  classes  of  densities  can  be  shown  to 
satisfy  (A)  and  (B). 

Example  2  Suppose  7r(x)  is  positive  everywhere,  and 
satisfies 

7r(x)  =  <(x)exp{-r(x)} 

where  t  and  r  are  polynomials  and  r  satisfies  the  fol¬ 
lowing  “positive  definiteness”  property.  Suppose  r  is  of 
degree  m,  then  if  is  the  polynomial  consisting  of  all 
r’s  m-th  order  terms,  rm(x)  oo  as  x  —►  oo.  Then  tt 
satisfies  (A)  and  (B),  and  so  the  random  walk  Metropolis 
algorithm  is  geometrically  ergodic. 

TT  is  example  1  fails  to  satisfy  the  positive  definiteness 
condition  since  Vm  = 


To  fix  ideas  in  this  section,  we  shall  assume  that  the 
proposal  distribution  is  normal,  so  that 


9(x,y) 


1 


exp{- 


(y  -  x)^(y  -  x) . 

20-2  f 


The  question  we  address  here  is:  how  should  we  choose 
a  to  make  the  algorithm  as  efficient  as  possible? 

Unfortunately  this  question  is  illposed  -  there  is  no 
unique  measure  of  efficiency  for  such  an  algorithm.  A 
discussion  of  different  measures  of  eflSciency  appears  in 
Gelman  Roberts  and  Gilks  (1994).  Instead,  we  use  an 
asymptotic  argument  as  d  gets  large.  Consider  first  the 
case  where  the  d-dimensional  target  density  has  the 
product  form: 

d 

’rd(x)  =  (1) 

1 

For  the  d-dimensional  problem,  choose  cr  =  ^l\fd.  It 
turns  out  that  this  is  the  right  way  of  scaling  the  vari¬ 
ance.  Now  define 

to  be  a  speeded  np  version  of  the  first  component  of  X. 
Now  is  making  smaller  and  smaller  jumps,  more  and 
more  often,  so  that  if  a  sensible  limit  process  exists  as 
d  — oo,  then  we  would  expect  it  to  be  a  continous  pro¬ 
cess.  Although  is  not  Markov  for  any  d,  the  limiting 
process  turns  out  to  be  a  Lange vin  diffusion,  and  is  hence 
Markov.  The  limiting  process  satisfies  the  stocheistic  dif¬ 
ferential  equation 


where 


and 


dY,  =  i^^^dt  +  h{<l>yf^dBu 


h{4)  =  2<I>H 


-L 


fix) 


dx 


(2) 

(3) 

(4) 


is  a  Fisher’s  information  measure  for  /  (F  =  1  for  stan¬ 
dard  normal  /).  The  limiting  value  of  Pjump  for  this 
sequence  of  problems  is 

The  speed  of  the  diffusion  /i(<^)  is  maximized  by  the 
choice 

=  ^  =  2.Z8/F^l^. 


Therefore  the  asymptotically^optimal  jumping  kernel  has 
variance-covariance  matrix  (<^^/d)/rf,  with  jumping  prob¬ 
ability  approximately  0.234. 
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This  is  the  simplest  of  such  results.  Many  more  gen¬ 
eral  forms  of  target  density  are  subject  to  similar  results. 
In  particular,  it  is  not  necessary  for  tt  to  have  the  prod¬ 
uct  form  of  (1).  Moreover,  the  asymptotically  optimal 
acceptance  rate  of  0.234  remains  robust  to  many  gener¬ 
alizations.  These  extensions  and  the  proof  of  the  above 
result  appear  in  Roberts  Gelman  and  Gilks  (1994). 

4  Summary  and  Conclusions 

The  random  walk  Metropolis  algorithm  is  often  thought 
of  as  a  default  option:  it  is  easy  to  implement,  and  it 
requires  and  uses  no  information  about  the  structure  of 
the  target  density  being  sampled.  As  such,  for  specific 
problems,  there  is  frequently  a  more  efficient  algorithm 
available.  However  in  return,  the  random  walk  Metropo¬ 
lis  algorithm  has  relatively  robust  theoretical  properties, 
as  discussed  in  Section  2.  In  contrast,  more  ‘tailor-made’ 
algorithms  such  as  those  derived  from  Langevin  diffusion 
approximations  can  have  highly  undesirable  properties 
(see  Roberts  and  Tweedie,  1994b). 

The  efficiency  results  of  Section  3  suggest  that  the 
algorithm  should  perform  best  with  overall  acceptance 
rates  in  the  range  [0.15,  0.4].  However  a  number  of  words 
of  caution  are  in  order. 

There  are  many  target  densities  for  which  the  random 
walk  Metropolis  algorithm  is  inappropriate,  for  instance 
highly  multi-modal  or  heavy  tailed  distributions.  The  re¬ 
sults  of  Section  3  only  provide  an  efficiency  measure  rela¬ 
tive  to  other  random  walk  Metropolis  algorithms.  There 
is  no  guarantee  that  there  exists  a  proposal  scaling  that 
gives  an  absolutely  efficient  algorithm. 

The  result  is  asymptotic,  and  although  it  is  supported 
by  simulation  studies  for  relatively  well-behaved  uni- 
modal  densities  (see  Gelman,  Roberts  and  Gilks,  1994), 
it’s  performance  on  low-dimensional  multimodal  prob¬ 
lems  is  unlikely  to  be  satisfactory. 

In  practice,  one  might  try  to  “fine-tune”  the  algorithm 
as  the  simulation  proceeds  in  order  to  obtain  actual  ac¬ 
ceptance  rates  within  the  range  suggested.  Care  has  to 
be  taken  with  such  a  procedure  since  the  stationarity  of 
the  target  density  can  be  compromised  by  such  a  non- 
Markov  procedure  (for  example  see  Gelfand  and  Sahu, 
1993).  Moreover,  observed  acceptance  rates  maybe  mis¬ 
leading  in  an  inefficient  algorithm.  Therefore  monitoring 
acceptance  rates  should  never  be  used  as  a  diagnostic  for 
efficiency. 
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Abstract.  We  present  a  general  method  for  prov¬ 
ing  rigorous,  a  priori  bounds  on  the  number  of  it¬ 
erations  required  to  achieve  convergence  of  Markov 
chain  Monte  Carlo.  We  describe  bounds  for  spe¬ 
cific  models  of  the  Gibbs  sampler,  which  have  been 
obtained  from  the  general  method.  We  discuss  pos¬ 
sibilities  for  obtaining  bounds  more  generally. 

1.  Introduction. 

Markov  chain  Monte  Carlo  techniques,  includ¬ 
ing  the  Metropolis- Hajstings  algorithm  (Metropolis 
et  al.,  1953;  Hastings,  1970),  data  augmentation 
(Tanner  and  Wong,  1986),  and  the  Gibbs  sampler 
(Geman  and  Geman,  1984;  Gelfand  and  Smith,  1990) 
have  become  very  popular  in  recent  years  as  a  way 
of  generating  a  sample  from  complicated  probabil¬ 
ity  distributions  (such  as  posterior  distributions  in 
Bayesian  inference  problems).  A  fundamental  is¬ 
sue  regarding  such  techniques  is  their  convergence 
properties,  specifically  whether  or  not  the  algorithm 
will  converge  to  the  correct  distribution,  and  if  so 
how  quickly.  Many  general  convergence  results  (e.g. 
Tierney,  1994),  qualitative  convergence-rate  results 
(Schervish  and  Carlin,  1992;  Liu,  Wong,  and  Kong, 
1991a,  1991b;  Baxter  and  Rosenthal,  1994),  and 
convergence  diagnostics  (e.g.  Roberts,  1992;  Gel- 
man  and  Rubin,  1992;  Mykland,  Tierney,  and  Yu, 
1992)  have  been  developed.  However,  none  of  these 
approaches  are  entirely  satisfactory  (Matthews,  1991; 
Cowles  and  Carlin,  1994). 

In  a  different  direction,  a  number  of  papers  have 
attempted  to  prove  rigorous,  quantitative  bounds  on 
convergence  rates  for  these  algorithms  (Jerrum  and 
Sinclair,  1989;  Frieze,  Kannan,  and  Poison,  1993; 
Meyn  and  Tweedie,  1993;  Lund  and  Tweedie,  1993; 
Mengersen  and  Tweedie,  1993;  Rosenthal,  1991, 1993a, 
1993b,  1994).  Such  results  often  provide  bounds 
which  are  very  weak,  and/or  for  very  specific  mod¬ 


els,  but  the  area  appears  to  be  worthy  of  further 
work. 

In  this  paper  we  describe  a  general  method  (Sec¬ 
tion  2)  for  proving  such  quantitative  bounds.  The 
method  requires  only  that  we  verify  a  drift  condi¬ 
tion  and  a  minorization  condition,  for  the  Markov 
chain  of  interest.  We  describe  (Section  3)  the  appli¬ 
cation  of  this  (and  related)  methods  to  various  spe¬ 
cific  examples  of  the  Gibbs  sampler,  including  vari¬ 
ance  components  models,  hierarchical  Poisson  mod¬ 
els,  and  a  model  related  to  James-Stein  estimators. 
In  some  cases,  the  bounds  appear  to  be  small  enough 
to  be  of  practical  use.  In  other  cases,  they  provide 
additional  theoretical  information  about  the  Gibbs 
sampler  for  the  model  being  studied. 

We  close  (Section  4)  with  a  brief  discussion  of 
possibilities  for  further  bounds  of  this  type. 

2.  The  general  method. 

The  simplest  form  of  our  general  method  is  the 
following,  taken  from  Rosenthal  (1993b,  Theorem 
12). 

Proposition.  Let  P{x,  *)  he  the  transition  proba¬ 
bilities  for  a  Markov  chain  Xq,  Xi,X2y ...  on  a  state 
space  X,  with  stationary  distribution  7r(  ).  Suppose 
there  exist  e>0,  0<A<1,0<A<  oo,  d  >  a 
non-negative  function  f  :  X  —^R,  and  a  probability 
measure  Q(')  on  X,  such  that 

E{fiXi)\Xo^x)  <  Xf{x)-\-A,  xeX  (1) 

and 

P(x,^)  >  eQ{-),  xefd  (2) 

where  fd  —  {x  E  X  \  f{x)  <  d},  and  where  P{x,  •)  > 

€  Q(*)  means  P{x,  S)  >  e  Q{S)  for  every  measurable 
Sex.  Then  for  any  0  <  r  <  1,  the  total  varia¬ 
tion  distance  to  the  stationary  distribution  after  k 
iterations  is  bounded  above  by 
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where 

Q,-l  ^  ^ -  <1;  y  =  l  +  2(Ac?-i-A). 

1  +  a 

Inequality  (1)  above  is  called  a  drift  condition^ 
while  inequality  (2)  above  is  called  a  minorizaiion 
condition.  The  proposition  thus  allows  for  precise, 
quantitative,  exponent! ally- decreasing  upper  bounds 
on  the  distance  to  stationarity,  as  a  function  of  the 
number  of  iterations  k,  using  just  these  two  inequal¬ 
ities. 

The  proof  of  this  proposition  involves  the  cou¬ 
pling  inequality y  which  states  that  the  total  variation 
distance  between  the  laws  of  two  random  variables  is 
bounded  by  the  probability  that  they  are  unequal. 
Proving  the  proposition  thus  amounts  to  (theoret¬ 
ically)  constructing  auxiliary  random  variables  Yk, 
so  that  C{Yk)  =  TT  but  P{Xk  =  Yk)  is  as  large  as 
possible.  Inequality  (2)  allows  us  to  construct  Xk 
and  Yk  jointly  so  that,  whenever  (Xk^Yk)  E  /d  x  /d, 
they  have  probability  e  of  becoming  equal  on  the 
next  generation.  Furthermore,  inequality  (1)  implies 
that  the  number  of  iterations  k  for  which  (Xk.Yk)  E 
fd  X  fd  will  be  large  with  high  probability.  Combin¬ 
ing  these  two  facts,  we  can  construct  Xk  and  Yk  so 
that  P{Xk  Tib)  is  small,  and  thus  use  the  coupling 
inequality  to  prove  the  proposition.  The  reader  is 
referred  to  Rosenthal  (1993b)  for  details. 

3.  Applications  to  specific  models. 

The  general  method  of  Section  2  (and  related 
methods)  have  been  applied  to  a  number  of  specific 
examples  of  the  Gibbs  sampler,  to  derive  informa¬ 
tion  about  their  rates  of  convergence  to  the  appro¬ 
priate  posterior  distributions. 

In  Rosenthal  (1993),  a  version  of  the  data  aug¬ 
mentation  algorithm  (a  special  case  of  the  Gibbs 
sampler)  was  applied  to  finite  sample  spaces.  It  was 
shown  that,  with  n  parameters  and  n  observed  data, 
the  algorithm  would  converge  in  O(logn)  iterations. 
Thus,  the  running  time  of  the  algorithm  does  not 
grow  too  quickly  with  the  number  of  parameters. 

In  Rosenthal  (1991),  the  Gibbs  sampler  applied 
to  variance  components  models  (as  discussed  in  Gel- 
fand  and  Smith,  1990;  Gelfand  et  al.,  1990)  was 
analyzed.  It  was  shown  that,  with  K  location  pa¬ 
rameters  each  having  J  observed  data,  the  (/^  -h 
3)-dimensional  Gibbs  sampler  would  approximately 

converge  in  0  ^1  +  iterations.  So  again,  the 

running  time  of  the  algorithm  does  not  grow  too 
quickly  with  the  number  of  parameters. 

In  Rosenthal  (1993b),  the  Gibbs  sampler  ap¬ 
plied  to  a  hierarchical  Poisson  model  was  analyzed, 


using  the  same  data  as  analyzed  in  Gelfand  and 
Smith  (1990).  For  this  data,  the  (11-dimensional) 
Gibbs  sampler  was  shown  to  have  total  variation 
distance  to  stationarity  after  k  iterations  bounded 
above  by 

(0.976)*  +  (0.951)*(6.2  +  £’(^(5(°^-6.5)^)), 

where  5^®)  =  is  a  sum  of  initial  values.  The 

bound  is  thus  explicit  and  quantitative,  and  depends 
explicitly  on  the  initial  distribution.  The  bound  is 
also  not  absurdly  large:  for  example,  if  E  ((5^°)  —  6.5)^) 
2  and  k  =  150,  this  bound  is  equal  to  0.03,  implying 
that  150  iterations  are  sufficient  to  achieve  random¬ 
ness. 

In  Rosenthal  (1994),  the  Gibbs  sampler  applied 
to  a  model  related  to  James-Stein  estimators  (James 
and  Stein,  1961)  was  analyzed.  The  model  (sug¬ 
gested  by  Jun  Liu)  was  designed  to  avoid  the  use  of 
guesses  and  empirical  estimates  in  the  usual  (empir¬ 
ical  Bayes)  formulation  of  James-Stein  estimators. 
The  Gibbs  sampler  was  intended  to  facilitate  com¬ 
putations  related  to  the  associated  posterior  distri¬ 
bution.  A  formula  was  provided  which  gave  a  bound 
on  convergence  of  the  Gibbs  sampler  explicitly,  in 
terms  of  the  number  of  iterations,  the  initial  distri¬ 
butions,  the  prior  distributions  of  the  model,  and 
the  observed  data.  When  applied  to  the  baseball 
data  analyzed  in  Efron  and  Morris  (1975)  and  Mor¬ 
ris  (1983),  it  proved  that  the  Gibbs  sampler  would 
converge  in  less  than  200  iterations. 

For  certain  other  prior  distributions,  it  was  shown 
(Rosenthal,  1994)  that  this  Gibbs  sampler  would  in 
fact  not  converge  at  all.  This  information  was  used, 
together  with  standard  convergence  theory,  to  prove 
that  for  these  (improper)  priors,  the  model  itself  was 
improper,  i.e.  the  posterior  distribution  was  non- 
normalizable.  Analysis  of  the  Gibbs  sampler  was 
thus  used  to  provide  additional  information  about 
the  model  under  consideration. 

Our  method  has  thus  been  applied  to  a  variety 
of  realistic  examples  of  the  Gibbs  sampler.  It  has 
provided  useful  quantitative  bounds,  convergence  in¬ 
formation  relating  the  running  time  to  the  number 
of  parameters,  and  additional  theoretical  informa¬ 
tion  about  the  underlying  model  itself. 

4.  Discussion. 

It  is  now  widely  recognized  that  convergence 
issues  are  crucial  for  the  successful  implementation 
of  Markov  chain  Monte  Carlo  algorithms.  However, 
no  method  is  entirely  satisfactory  for  demonstrating 
such  convergence  or  providing  a  convergence  rate. 
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We  have  provided  a  general  method  (Section 
2)  for  rigorously  and  explicitly  bounding  the  con¬ 
vergence  of  these  Markov  chain  algorithms.  The 
method  requires  only  that  we  verify  a  drift  condition 
and  a  minorization  condition  for  the  Markov  chain 
under  consideration.  In  principle  the  method  can 
be  applied  to  virtually  any  Markov  chain  algorithm, 
and  does  not  require  special  structure  such  as  spec¬ 
tral  information  or  reversibility.  However,  it  is  to 
be  admitted  that,  in  complicated  high- dimensional 
problems,  even  the  verification  of  the  two  required 
conditions  can  be  quite  difficult. 

We  have  described  the  application  of  this  method 
to  several  models  of  the  Gibbs  sampler.  These  mod¬ 
els  are  realistic  and  non-trivial,  and  our  method  pro¬ 
vides  useful  information  about  their  convergence  prop¬ 
erties.  The  theoretical  results  appear  to  be  at  the 
point  where  they  can  begin  to  have  practical  impli¬ 
cations. 

However,  each  of  these  applications  has  required 
additional,  extensive  computation.  Furthermore,  sim¬ 
ilar  computation  may  be  extremely  difficult  for  more 
complicated  models.  Hence,  further  work  is  required 
before  these  methods  are  easily  usable  in  very  gen¬ 
eral  applied  settings.  It  is  possible  that  the  theoreti¬ 
cal  approach  described  here  can  be  combined  with  a 
more  practical  analysis,  for  example  by  attempting 
to  verify  drift  and  minorization  conditions  through 
additional  simulation  (Cowles  and  Rosenthal,  1994), 
which  might  allow  for  wider  use. 

In  any  case,  while  there  is  much  work  to  be 
done,  the  methods  described  here  appear  to  hold 
promise  for  providing  rigorous  rates  of  convergence 
for  many  additional  examples  of  Markov  chain  Monte 
Carlo. 
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1  Introduction 

The  Gibbs  sampler  and  other  MCMC  methods 
(Gelfand  and  Smith  1990,  Smith  and  Roberts  1993, 
Tanner  and  Wong  1987),  which  become  popular  re¬ 
cently  in  statistical  analysis  with  complicated  mod¬ 
els,  are  no  more  than  some  devices  for  generat¬ 
ing  random  samples  from  an  analytically  intractable 
target  distribution.  The  basic  idea  underlying  all 
these  methods  is  to  construct  a  Markov  chain  with 
the  target  distribution  as  its  equilibrium  distribu¬ 
tion.  The  methods  differ  only  in  the  use  of  Markov 
transition  functions.  For  example,  the  transition 
function  for  the  Gibbs  sampler  with  systematic  scan 
can  be  expressed  as  a  product  of  a  sequence  of  con¬ 
ditional  distributions  (Smith  and  Roberts  1993,  Liu, 
Wong  and  Kong  1994b);  while  the  transition  func¬ 
tion  for  a  Metropolis-Eastings  algorithm  consists  of 
a  “proposed”  transition  and  a  “thinning  down”  de¬ 
vice  (Metropolis  et  al.  1953,  Hastings  1970,  Smith 
and  Roberts  1993).  Many  theoretical  work  has 
emerged  in  understanding  convergence  properties  of 
the  MCMC  methods.  See,  for  example,  Geman  and 
Geman  (1984),  Gelman  and  Rubin  (1992),  Geyer 
(1992),  Liu,  et  al.  (1994a,b),  Liu  (1992,  1994), 
Mykland,  Tierney  and  Yu  (1993),  Roberts  (1992), 
Roberts  and  Poison  (1994),  Rosenthal  (1993a,b), 
Schervish  and  Carlin  (1993),  Tierney  (1991),  just  to 
start  a  list.  Here,  by  taking  a  slightly  different  angle 
to  look  at  the  convergence  problem,  we  investigate 
relationships  among  various  concepts  in  describing 
a  Gibbs  sampler  and  the  associated  Bayesian  miss¬ 
ing  data  problem:  the  rate  of  convergence,  sample 
autocorrelations,  and  the  fraction  of  missing  infor¬ 
mation. 
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We  distinguish  two  different  situations  for  the 
Gibbs  sampler:  Data  Augmentation  which  refers  to 
a  Gibbs  sampler  with  only  two  iterative  components 
(see  Tanner  and  Wong  1987  for  its  original  version, 
and  Liu  et  al.  1994a  for  structural  study),  and  the 
general  Gibbs  sampler  (Gelfand  and  Smith  1990). 
A  reason  for  doing  this  is  that  the  two  component 
case  provides  us  some  extra  structure  that  a  general 
Gibbs  sampler  does  not  possess,  and  the  analysis  of 
this  simple  case  can  suggest  some  useful  methods  for 
dealing  with  more  general  ones. 

By  making  use  of  covariance  structures  of  Data 
Augmentation  established  in  Liu  et  al.  (1994a, b), 
we  find  that  the  convergence  rate  of  the  induced 
Markov  chain  can  be  characterized  by  the  maximal 
fraction  of  missing  information,  which  is  closely  re¬ 
lated  to  the  work  of  Meng  and  Rubin  (1992)  for  the 
EM  algorithms.  Conversely,  because  of  this  charac¬ 
terization,  we  can  use  autocorrelations  of  a  station¬ 
ary  Gibbs  sampling  sequence  to  estimate  the  frac¬ 
tion  of  missing  information  of  any  quantity  of  inter¬ 
est,  which  is  useful  for  deciding  how  many  multiple 
imputations  will  be  provided. 

This  article  is  arranged  as  follows.  We  review  the 
concept  of  fraction  of  missing  information  in  Section 
2,  In  Section  3,  we  present  structures  and  several 
connections  for  Data  Augmentation.  A  generaliza^ 
tion  to  the  general  Gibbs  sampler  is  contained  in 
Section  4.  A  graphical  method  for  comparing  dif¬ 
ferent  schemes,  using  the  relationships  found  in  Sec¬ 
tions  3  and  4,  is  described  in  Section  5.  In  Section  6, 
we  analyze  an  example  for  match-making  in  “broken 
regression”  (DeGroot,  Feder,  and  Goel  1971). 

2  The  Fraction  of  Missing 
Information 

The  concept  of  fraction  of  missing  information  was 
first  introduced  together  with  the  so-called  missing 
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information  principle  by  Orchard  and  Woodbury 
(1972),  It  is  later  proved  to  be  an  important  concept 
for  studying  the  EM  algorithms  (Dempster,  Laird 
and  Rubin  1977).  Specifically,  Louis  (1982)  pre¬ 
sented  a  method  for  finding  the  observed  informa¬ 
tion,  and  Meng  (1991)  and  Meng  and  Rubin  (1993) 
systematically  explored  the  concept  and  used  it  to 
characterize  the  rate  of  convergence  for  the  EM  and 
the  ECM  algorithms. 

To  introduce  the  fraction  of  missing  information 
conveniently,  we  let  0  denote  the  parameter  vector 
in  our  model,  let  Y  denote  the  observed  part  of  an 
imaginary  complete  data  set,  and  let  Z  denote  the 
missing  part.  A  simple  identity  underlying  the  miss¬ 
ing  information  principle  and  the  EM  algorithms  is 

log[p(0  I  Y)]  =  log[p(0  I  y,  Z)] 

-  log\p{Z  1  0,  Y)]  +  log[p(Z  I  Y)], 

which  implies 

log p(0  I  Y)  _  logp(0|Y,Z) 

302  302 

dHogp{Z\e,Y) 

302 

Integrating  out  the  missing  data  Z  with  respect  to 
p(Zl0,  Y),  we  arrive  at  the  following  missing  infor¬ 
mation  principle 

Observed  Information  =  Complete  Information 
—  Missing  Information. 

Denoting  each  term  by  lobs,  hom,  and  ImU,  respec¬ 
tively,  we  can  define  the  fraction  of  missing  infor¬ 
mation  as 

lmt>(0)  _  - 

"  Icomie)  ~  Iccmiey 

where  the  I  functions  are  evaluated  at  the  true  pa¬ 
rameter  value.  When  0  is  a  1-dim  parameter,  the 
above  quantity  is  well  defined.  Otherwise,  the  above 
definition  takes  a  matrix  form.  Meng  (1991)  used 
the  largest  eigenvalue  of  the  missing  fraction  matrix 
^mis(^)^com (©)  to  characterize  the  convergence  rate 
of  the  EM  algorithm. 

Now  let  us  take  a  Bayesian  viewpoint.  Suppose 
a  prior  distribution  po(©)  is  given,  and  we  are  in¬ 
terested  in  A  =  A(0)  (one  can  view  this  as  a  way  of 
eliminating  nuisance  parameters) .  If  one  can  impute 
the  missing  data,  i.e.,  draw  samples  Z^^\  . 
from  the  predictive  distribution  p{Z\Y),  then  the 


posterior  distribution  of  A,  p(AlY),  can  be  approxi¬ 
mated  by 

p{h  I  y)  «  ^{p{h\Y,  +  •  •  -pCftlY,  2r(’”))}. 

For  example,  Z^^\^  •  • ,  Z^^^  can  be  draws  from  an 
iterative  sampling  scheme.  When  using  the  above 
multiple  imputation  type  of  approximations,  the 
fraction  of  missing  information  is  usually  important 
for  one  to  understand  the  impact  of  the  missing  data 
on  the  estimation  of  A.  Also,  it  is  important  for 
one  to  decide  how  many  imputations  should  be  pro¬ 
vided.  As  Rubin  (1987)  advocated,  m  can  be  chosen 
as  small  as  3  to  5  for  estimating  posterior  mean  of 
A.  Of  course,  in  this  case,  the  fraction  of  missing 
information  with  respect  to  A  can  not  be  too  high. 

The  fraction  of  missing  information  in  the 
Bayesian  framework  can  be  easily  defined  as  (Ru¬ 
bin  1987) 

var{£'(A  \Y,Z)\  Y} 

~  var(/i  I  Y) 

E{v^x{h\Y,Z)\Y} 
var(A  I  Y) 

which  can  be  explained  as  the  extra  variation  caused 
by  missing  Z, 

Note  that  in  large  sample  and  when  h=^0^  since 
var(A|Y)  «  l/Iobs  and  £'{var(A|Y,  Z)}  l/hom, 
the  two  definitions  of  the  fraction  of  missing  infor¬ 
mation,  and  7^,  are  equivalent. 

3  Structures  for  Data 
Augmentation 

We  call  a  special  situation  of  the  Gibbs  sampler 
Data  Augmentation  if  there  are  only  two  compo¬ 
nents  for  iterative  sampling  (Liu  et  al.  1994).  We 
use  0  and  Z  to  denote  the  respective  components 
in  Data  Augmentation  to  emphasize  its  connection 
with  Bayesian  missing  data  problems. 

Let  ,  Z^^^ ,  0^^^ ,  Z^^^ , . . . ,  be  consecutive 
draws  from  a  stationary  Data  Augmentation.  In 
other  words,  we  assume  that  0^^)  is  drawn  from 
the  target  distribution  p(0|Y,  Z).  In  the  following, 
since  everything  is  conditioned  on  Y,  we  will  omit 
it  in  all  expressions.  For  example,  when  we  write 
E{h{Q)\Z}y  it  actually  means  £?{A(0)|Y,  Z}. 

Consider  two  consecutive  draws  from  Data  Aug¬ 
mentation,  we  find  that 

=  E{E{h^’‘h^’‘+^^  I  (1) 
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=  E{Eih(-'‘'>  I  I  ^(*))} 

=  E{E\h\Z)}, 

where  the  first  equality  follows  from  an  elementary 
fact  that  E{A)  =  E[E{A\B)]\  the  second  and  third 
equalities  follow  from  the  fact  that  and  ©(*+^) 
are  conditionally  independent  and  identically  dis¬ 
tributed  given  These  facts  can  be  illustrated 
by  the  following  diagram: 

©(1)  0(2)  ©(3)  .  .  , 

\/\/\/ 

^(1)  2’(2)  •  •  • 

From  the  diagram,  we  observe  that  connects 
with  through  and,  from  the  definition  of 
the  scheme,  and  (©(*),  have  the 

same  joint  distribution  when  the  chain  is  stationary. 
These  two  properties  only  hold  for  Data  Augmenta¬ 
tion,  not  for  the  general  Gibbs  sampler.  However, 
this  type  of  dependence  graph  can  be  applied  to  a 
general  Gibbs  sampler  and  provide  useful  intuition. 
In  Section  5  we  will  illustrate  how  to  use  these  diar 
grams  to  compare  different  schemes. 

As  a  consequence  of  (2),  we  have  the  following 
identity 

cov{h(©(*)),h(©(*+9)}  =  var[£;{h(©)  |  Z}] 

The  formula  implies  that  the  correlation  coefficient 
between  the  two  consecutive  h^s  are 

An  intuition  of  this  is  that  the  higher  the  frac¬ 
tion  of  missing  information,  the  more  “sticky”  the 
sample  outputs  from  Data  Augmentation,  and  vice 
versa.  The  extra  variance  caused  by  the  missing 
data,  var{£^(/i|Z)},  can  then  be  estimated  as 

-  m— 1 

Jb=l 

If,  on  the  other  hand,  g{Z)  =  E(h\Z)  is  easy  to 
compute,  one  may  also  approximate  vdix{E{h\Z)}  by 

m 

1=1 

where  =  E{h\Z^*^)  and  gm  =  + - h 

g^"*^)/Tn.  This  is  a  variation  of  Rao-Blackwellization 
(Gelfand  and  Smith  1990,  Liu  et  al.  1994a). 


Intuitively,  it  seems  that  the  latter  estimation  is 
better.  For  example, 

var{M^)M2)j  _  E{(h^^'>h^^^f}-[E{E^{h\Z)}]'^ 

=  E{E^{h^  I  Z)}  -  [E{E'^{h\Z))f, 

while 

var(</2)  =  E{E^{h  |  Z)}  -  lE{E^(h  |  Z)}]^. 
Hence,  by  the  Cauchy-Schwarz  inequality,  we  have 
var(y^)  <  var{/i(^^/i(^^}. 

Furthermore,  by  Theorem  3.1  of  Liu  et  al.  (1994) 

cov{(</(9)2,(^(*=+i))2j 

=  var{E(...ElE{g^(Z)l0}IZ]---)} 

where  the  right  hand  side  has  k  expectation  signs. 
Also,  we  notice  that 

E{g^(Z)ie}  =  E{E[g(Z)h(0)lZ]le}. 

For  v„,is,  we  let  /(©)  =  E[E{h(0)lZ}le],  which 
is  just  £?(/i(2)|©(i)).  Then  we  have 

cov(/i(^>/i(2),/i(*+i)/i(*+2)) 

=  cov(A(2)/(2),  /i(*+i)/i(*+2)^ 

which,  for  the  same  reason  as  above,  has  the  follow¬ 
ing  expression 

var{i;(...^[^{/i(©)/(©)|Z}|©]...)} 

where  there  are  Ar  —  1  expectation  signs  on  the  right 
hand  side.  However 

E{h(e)f(e)iz}  =  E{Eih(e)g(z)  \e]\z} 

If  we  compare  the  expression  of  lag-A?  autocovariance 
for  the  (^(*^)^  sequence  with  that  for  the  /i(*)/i(*+^) 
sequence,  we  find  that  the  former  always  has  one 
more  conditional  expectation  sign  than  the  latter. 
However  since  the  orders  of  the  conditionings  are 
different,  there  is  no  clear  comparison  between  the 
two  except  for  the  case  when  lag=l,  in  which  case, 
the  autocovariance  for  the  latter  expression  is  always 
greater  than  or  equal  to  the  former. 

The  following  analogy  is  helpful  for  understand¬ 
ing  the  above  discussion.  Consider  two  scenarios: 

(i)  a  vector  a  is  projected  to  vector  b  and  then  to 
vector  c;  (ii)  a  is  directly  projected  to  c.  How  do  we 
compare  the  length  of  the  projections?  Apparently, 
if  the  three  vectors  are  in  the  same  plane  and  b 
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lies  between  a  and  c,  the  latter  projection  is  smaller 
than  the  former  one.  But  in  most  other  cases,  the 
former  is  smaller  than  the  latter.  This  corresponds 
to  comparing  \ax[E{E{X\Y)\Z}]  and  vai{E{X\Z)}. 

For  any  two  random  variables  U  and  we  define 
the  maximal  correlation  between  them  as 

R{U,  V)  =  sup  corr{t{U), 

var{t(t/)}=var{«(i^)}r=i 

It  is  well  understood  that  for  a  reversible  sta¬ 
tionary  Markov  chain  ,  the  maxi¬ 

mal  correlations  between  two  consecutive  states, 
is  equal  to  A,  where  1  —  A  is  the 
so-called  “spectral  gap.”  See  Liu  et  al.  (1994a,b) 
for  more  references.  For  discrete  case,  A  is  just  the 
magnitude  of  the  second  largest  eigenvalue  (in  ab¬ 
solute  value).  For  nonreversible  chain,  the  scaled 
long-range  maximal  correlation  is  equal  to  A  (Liu  et 
a.  1994b).  That  is, 

lim  =  A. 

k—^oo 

It  is  shown  in  Liu  et  al.  (1994a)  that  the  maxi¬ 
mal  correlation  between  two  consecutive  draws  of 
Data  Augmentation,  is  the  intrinsic 

rate  of  convergence  of  the  scheme,  and  is  equal  to 

i^2(e,Z). 

On  the  other  hand,  under  mild  conditions  (see 
Csaki  and  Fischer  1960),  there  exists  a  pair  of  func¬ 
tions  ho{S)  and  go{Z)  with  unit  variance  such  that 
coTT{ho,go)=R{^i  (denoted  as  R  later),  and 

f;{^o(^)|©}  =  Rho{e)  (2) 

E{hQ{e)\Z}  =  Rgo{Z)  (3) 

Therefore,  fto  suffers  the  maximal  fraction  of  missing 
information 

Tb(^o)  =  var{jE;(ho|-^)}/var(Ao)  =  R^, 

and  the  maximal  fraction  of  missing  information  is 
equal  to  the  rate  of  convergence  of  Data  Augmen¬ 
tation.  If  a  function  h  is  correlated  with  ho  (with 
respect  to  ?r),  then 

as  k  goes  to  infinity.  This  follows  from  spectral  de¬ 
composition  of  h  (Liu  1991,  Garen  and  Smith  1994, 
Roberts  1992).  It  suggests  that  the  maximal  frac¬ 
tion  of  missing  information  can  be  estimated  by  the 
output  sequence  of  the  Gibbs  sampler. 


4  Missing  Information  in  the 
General  Gibbs  Sampler 

We  now  turn  our  attention  to  the  general  Gibbs  sam¬ 
pler  with  systematic  scan.  There  are  two  situations 
commonly  encountered  in  practice.  We  shall  discuss 
them  in  the  order  of  increasing  complexity. 

Case  1.  ©  =  (^1,^2),  Z  =  Z.  That  is,  given 
0,  Z  can  be  drawn  directly;  but  61  must  be  drawn 
conditional  on  both  62  and  Z,  and  62  must  be  drawn 
conditional  on  0i  and  Z.  Note  that  this  can  be  gen¬ 
eralized  obviously.  The  following  diagram  illustrates 
the  sampler: 


Hence, 

=  va,T[E{h{9i)\92,Z}] 

which  implies  that  lag-1  autocorrelation  of  the  h  se¬ 
quence  is  in  general  not  its  fraction  of  missing  in¬ 
formation  with  respect  to  Z,  but  is  a  quantity  that 
reflects  dependency  between  61  and  (^2,  Z).  Note 
that 

y3iT[E{h{ei)  I  02, Z}]  >  yM[E{h{0i)\Z}]. 

Another  way  around  is  to  design  a  function  g{Z) 
and  to  estimate  the  maximal  correlation  between  © 
and  Z  from  it.  For  example,  if  it  happens  that  we 
know  go  in  (2)  and  (3),  then  by  Lemma  4  of  Liu 
(1994), 

=  var[£;{<?o(^)  1 0}] 
=  R?  var{ho(0)}» 

Here  R^  is  the  maximal  fraction  of  missing  informa¬ 
tion  and  is  an  upper  bound  for  This  duality 

provides  us  the  following  scheme  for  obtaining  an 
estimate  of  the  maximal  fraction  of  missing  infor¬ 
mation. 

Step  1.  Design  a  function  g{Z).  Usually  this  can 
just  be  a  linear  function  (e.g.,  see  Liu  1991). 
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Step  2.  Estimate  lag- A:  autocorrelation  for  the  g 
sequence  for  A:  =  1, 2, . . after  the  chain  converges, 
and  fit  the  exponential  model 

r*  =  cp^. 

Garren  and  Smith  (1994)  provided  refined  methods. 
The  fitted  value  p  is  an  estimate  of  i^(©,  Z). 

Case  2.  ©  =  (^1,^2)  and  Z  =  (2:1,  ^2)-  This  is 
the  case  where  the  fraction  of  missing  information 
can  not  be  estimated  from  the  sample  autocorrela¬ 
tions.  The  maximal  fraction  of  missing  information 
can  be  extracted  from  long  range  autocorrelations 
by  the  same  reason  as  explained  in  Case  1. 

5  Compare  Schemes  via 
Diagrams 

In  running  a  Gibbs  sampler  or  a  more  general 
MCMC  algorithm,  one  usually  has  flexibilities  in  de¬ 
signing  sampling  schemes.  As  with  many  iterative 
methods  ,  we  are  usually  faced  with  a  dilemma:  we 
either  have  to  sacrifice  computational  ease  for  iter¬ 
ative  simulation  in  exchange  for  fast  convergence, 
or  have  to  suffer  slow  convergence  in  exchange  for 
computational  simplicity.  Only  in  some  rare  situ¬ 
ations  as  explored  in  Liu  (1994)  be  we  satisfied  in 
both  ways.  Specifically,  when  the  Bayesian  predic¬ 
tive  distribution  is  simple,  one  can  use  the  predictive 
updated  version  to  improve  convergence  without  sac¬ 
rificing  computational  simplicity.  Liu  et  al.  (1994a) 
and  Liu  (1994)  provided  some  theoretical  arguments 
based  on  operator  theory.  Here  we  use  diagrams  to 
illustrate  autocorrelation  structures.  We  hope  that 
the  analysis  in  this  section  can  shed  light  on  more 
complicated  general  situations. 

For  the  sake  of  simple  argument,  suppose  the 
sampler  involves  three  components  {0\,92yZ)  and 
each  component  is  visited  in  turn:  9i  O2  Z. 
The  following  diagram  shows  dependency  between 
two  consecutive  iterations.  For  example,  9^^^  is  gen¬ 
erated  by  a  draw  from  '^(9i\9^^\  Z),  which  is  illus¬ 
trated  in  the  diagram  by  two  arrows  connecting  9^^ 
and  Z  with  Other  arrows  have  similar  impli¬ 
cations.  This  diagram  shows  that  the  two  consecu¬ 
tive  states  depend  on  each  other  via  the  connection 
between  {9^^^Z^^^)  and  as  illustrated  by 

three  arrows  in  the  middle  or  the  diagram. 


Next  diagram  illustrates  a  grouping  scheme, 
where  it  is  assumed  that  given  Z,  (^1,^2)  can  be 
drawn  together.  The  diagram  illustrates  that  de¬ 
pendency  between  two  consecutive  states  is  via  the 
connection  between  Z^^^  and  (^P\  ^2^^))  where  only 
two  arrows  are  used  for  this  connection.  Compared 
with  the  above  diagram  for  the  original  sampler, 
dependency  between  the  two  consecutive  states  for 
grouping  is  weaker. 


Our  final  diagram  represents  the  collapsing 
scheme,  in  which  we  assume  that  ^2  can  be  theoret¬ 
ically  integrated  out  so  that  the  sampler  is  applied 
only  to  the  two  remaining  components.  In  this  dia¬ 
gram,  the  only  connection  between  two  consecutive 
states  is  that  between  Z^^^  and  9^\  Only  one  ar¬ 
row  is  used,  which  indicates  the  weakest  correlation 
among  the  three  schemes. 


We  expect  that  this  type  of  analysis  can  be  gener¬ 
alized  to  other  situations  to  help  one  design  efficient 
sampling  schemes. 
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6  An  Example:  Broken 
Regression 

Suppose  Xi,  i  =  are  i.i.d.  normal  with 

variance  r^;  and  j/,*  =  a  +  +  €,*,  where  the  e,* 

are  i.i.d.  from  Ar(0,  cr^).  It  is  a  standard  regres¬ 
sion  problem  if  we  observe  (aj* ,  y,)  for  i  =  1,  •  •  • ,  100. 
Suppose,  however,  the  pairing  information  is  some¬ 
how  lost  and  we  can  only  observe  u*,  t  =  1, . . . ,  100, 
a  random  shuffle  of  the  j/,-.  The  problem  is  no  longer 
trivial.  This  can  also  be  viewed  as  a  special  case  of 
file  matching  problem.  DeGroot  et  al.  (1971)  stud¬ 
ied  this  problem  with  an  objective  to  maximize  the 
number  of  correct  matches.  We  are  interested  in  es¬ 
timating  /3  and  the  corresponding  fraction  of  missing 
information  (for  not  knowing  the  matching). 

Let  Q  be  the  permutation  that  produces  the  «,• 
from  the  j/j .  The  main  difficulty  is  that  Q  is  miss¬ 
ing.  Let  0  =  (a,/?)  and  U  =■  («i, . . «ioo)-  With  a 
prior  distribution  on  0,  Data  Augmentation  can  be 
applied  if  we  can  (a)  draw  Q  from  p(Q|0,  U)  and  (b) 
draw  0  from  p(0|Q,  U),  Step  (b)  is  simple  since  it 
only  involves  multivariate  ^-distribution.  Step  (a)  is 
nontrivial.  As  was  implemented  in  a  preliminary  re¬ 
port  of  Y.  Wu  (Dept,  of  Statist.,  Harvard  U.),  step 
(a)  can  be  accommodated  by  a  ‘‘Metropolized  shuf¬ 
fling”  scheme.  Roughly  speaking,  a  random  shuf¬ 
fling  scheme  is  employed  that  provides  us  a  Markov 
chain  on  the  space  of  all  permutations.  Based  on  this 
chain,  we  can  apply  Metropolis-Hastings  rejection 
rule  to  achieve  our  target  distribution  p(Q|0,tf). 
In  our  simulation,  we  used  switch  shuffling  (ran¬ 
domly  draw  two  cards  and  switch  them).  Within 
each  iteration  (i.e.,  a  cycle  of  Steps  (a)  and  (b)), 
500  Metropolized  shuffles  were  conducted,  since,  as 
theory  suggested,  0(nlog(n))  steps  are  needed  to 
shuffle  n  cards  uniformly. 

We  simulated  a  data  set  with  =  1,  =  1, 

and  a  =  0.  Assuming  that  or  =  0  is  known,  we 
used  a  flat  prior  for  /?.  Figure  1  illustrates  our  re¬ 
sults.  Panel(l,l)  shows  the  posterior  distribution  of 
13,  where  the  a;’s  were  simulated  from  N{0,1)  and 
the  true  (3  was  zero.  As  indicated,  its  variance  is 
0.12,  considerably  larger  than  0.01,  the  complete- 
data  posterior  variance  of  /?.  Panel(l,2)  shows  the 
autocorrelations  among  the  /?’s.  The  fraction  of 
missing  information  can  be  estimated  as  7^  =  0.924 
from  the  autocorrelation  plot.  As  theory  in  Sections 
2  and  3  indicated, 

(1  -  TB)var(/?  I  U)  =  £;{var(/3  |  U,  Q)} 
where  the  RHS  is  average  complete-data  variance. 


This  identity  was  experimentally  confirmed  since 
(1  —  0.923)  X  0.12  =  0.009  which  is  close  to  the 
theoretical  value  0.01.  Panel (2,1)  is  the  same  pos¬ 
terior  distribution,  but  the  a;’s  were  simulated  from 
jV(0, 1)  and  the  true  ^=0.  With  the  x’s  far  from 
origin,  both  the  posterior  variance,  0.021,  and  the 
fraction  of  missing  information,  0.619,  were  consid¬ 
erably  smaller.  In  Panel(3,l),  the  x  were  simulated 
from  Ar(l,l)  and  the  true  /?  =  1.  It  seems  to  sug¬ 
gest  that  the  fraction  of  missing  information  is  not 
related  to  the  true  value  of  P,  but  is  very  sensitive 

toE*?- 

An  intuitive  solution  of  the  problem  is  to  sort 
both  the  X  and  the  u  first  and  then  do  a  regression  on 
the  sorted  data.  But  this  procedure  overestimates  j3 
and  does  not  provide  proper  inference.  The  above 
Bayesian  method  we  employed,  however,  is  unbiased 
(with  flat  prior)  and  supplies  proper  variance  esti¬ 
mation.  When  52  extremely  large,  the  sorting 
method  (essentially  any  method)  works  well,  imply¬ 
ing  that  the  matching  information  is  unimportant 
for  the  inference  of  /?.  This,  together  with  the  fore¬ 
going  simulation  study,  suggests  a  conjecture  that 
the  fraction  of  missing  information  for  monotonely 
decreases  as  52  increases. 
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Abstract 

The  ability  to  sample  latent  variables  using  Markov 
chain  Monte  Carlo  (MCMC)  has  had  a  major  impact  on 
computations  relating  to  the  genetic  analysis  of  complex 
traits,  or  traits  observed  on  complex  pedigrees.  One 
area  in  which  exact  likelihood  computation  is  often 
infeasible  is  multilocus  linkage  mapping.  One  method  of 
linkage  analysis  for  rare  recessive  traits  is  homozygosity 
mapping  where  data  on  aifected  inbred  individuals  are 
analysed.  Key  to  this  method  are  the  patterns  of  au¬ 
tozygosity  in  the  individuals,  and  MCMC  provides  also  a 
method  for  studying  these  patterns.  Algorithms  for  the 
exact  computation  of  autozygosity  probabilities  on  an 
arbitrary  pedigree  very  rapidly  become  computationally 
infeasible.  However,  an  MCMC  algorithm  can  provide 
accurate  estimates  in  reasonable  computing  time,  and 
these  probabilities  can  then  be  used  to  map  the  genes 
responsible  for  disease. 

1.  Introduction 

Monte  Carlo  likelihood  is  becoming  increasingly  used 
where  exact  likelihood  analysis  is  computationally  in¬ 
feasible.  One  area  in  which  such  likelihoods  arise  is  that 
of  genetic  mapping,  where  the  locations  in  the  genome 
of  genes  influencing  a  given  trait  are  to  be  inferred. 

The  elements  of  genetic  models  are  straightforward: 
genes  exist,  genes  segregate  (are  copied)  from  parents  to 
offspring,  and  the  types  of  genes  carried  by  an  individual 
influence  observable  trait  characteristics.  A  locus  is  a 
specification  of  the  position  of  a  gene  on  a  chromosome. 
With  modern  molecular  genetic  techniques,  individuals 
can  be  typed  for  a  wide  variety  of  DNA  markers  of 
known  location  in  the  genome.  These  DNA  markers 
can  be  chosen  to  be  highly  polymorphic;  there  are  many 
different  alleles  (types  of  genes)  that  an  individual  may 
have.  The  genes  at  these  DNA  marker  loci  segregate  in 
a  Mendelian  way  (Mendel,  1866);  each  individual  has 
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two  genes  at  the  locus,  one  a  copy  of  a  randomly  chosen 
one  of  the  two  in  his  father,  and  the  other  a  copy  of  a 
randomly  chosen  one  in  his  mother.  Segregation  of  genes 
from  different  parents  to  a  child,  and  from  a  parent  to 
different  children,  are  independent.  These  simple  50/50 
probabilities  underlie  all  of  genetics,  but  in  considering 
the  joint  segregation  at  several  genetic  loci,  or  the 
pattern  of  single-locus  segregations  on  an  extended 
family,  computations  can  rapidly  become  very  complex, 
principally  because  not  all  the  relevant  information  can 
be  observed. 

Genetic  loci,  Li,...,Ljb  that  index  segments  of  DNA 
on  the  same  chromosome  are  “linked” ;  the  segregations 
of  genes  at  two  loci  are  not  independent.  If  the  maternal 
gene  at  locus  Lh  in  a  father  segregates  to  a  child,  it 
is  more  probable  that  the  gene  that  segregates  at  an 
adjacent  locus,  Tj,  is  also  the  father’s  maternal  gene. 
Similarly  for  the  father’s  paternal  gene,  and  similarly 
also  for  genes  segregating  from  the  mother.  This  de¬ 
pendence  can  be  expressed  through  the  “recombination 
fractions”,  between  the  two  loci.  Specifically,  the 
probability  that  genes  at  loci  Lh  and  Lj  segregating 
from  one  parent  to  the  child  have  different  grandparental 
origins  is  Vh^j*  In  fact,  the  value  of  a  recombination 
fraction  between  two  loci  depends  on  numerous  factors, 
most  importantly  on  the  sex  of  the  parent.  This  fact 
can  be  incorporated  into  analyses,  but,  for  simplicity,  is 
ignored  in  the  current  paper. 

The  biological  phenomenon  underlying  recombination 
is  a  “crossover”  between  the  two  parental  chromosomes 
in  the  formation  of  the  offspring  chromosome.  There 
will  be  a  recombination  between  loci  Lh  and  Lj  if  there 
is  an  odd  number  of  crossover  events.  The  genetic 
(map)  distance  between  two  loci  is  the  expected  number 
of  crossovers  between  them,  and  hence  is  additive 
(Haldane,  1919).  However,  the  data  provide  information 
only  on  recombination  frequencies  between  loci  (Fisher, 
1922).  This  pattern  is  related  to  map  distance,  but  also 
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depends  on  the  pattern  of  interference.  Interference 
is  the  name  given  to  the  biological  phenomenon  that 
a  crossover  at  one  point  on  a  chromosome  affects  the 
chance  that  crossovers  occur  at  other  points  •  in  the 
vicinity.  Under  an  assumption  of  no  interference, 
recombination  events  occur  along  the  chromosome  as 
a  Poisson  process  rate  1,  when  the  chromosome  is  mea¬ 
sured  in  units  of  map  distance.  In  practice,  interference 
exists,  particularly  where  the  loci  are  close  together 
and  recombination  fractions  between  them  are  small. 
However,  the  amount  of  data  required  to  estimate  levels 
and  patterns  of  interference  seldom  exists  in  human 
genetic  studies.  In  genetic  mapping,  the  objective  is 
to  detect  linkage,  to  infer  locus  order,  and  place  loci 
on  a  chromosome  by  estimating  recombination  fractions 
between  them.  For  such  purposes,  interference  can 
safely  be  ignored. 

Now  in  mapping  a  genetic  disease,  marker  types 
will  be  available  for  some  individuals  in  a  pedigree  in 
which  the  disease  is  segregating.  Disease  or  relevant 
quantitative  trait  data  will  be  available  also  for  some 
members  of  the  pedigree.  However,  first,  not  all 
individuals  will  be  observed;  some  will  be  unavailable, 
particularly  ancestors.  Second,  the  genes  underlying  the 
trait  phenotypes  may  not  be  determined;  for  example, 
for  a  recessive  disease,  two  copies  of  the  disease  allele  are 
needed  to  express  the  trait,  but  those  who  do  not  express 
it  may  have  one  copy  of  the  disease  allele,  or  none. 
Third,  even  where  single-locus  marker  genotypes  are 
observable  the  haplotype  information  is  not;  that  is,  it 
is  not  known  which  alleles  are  on  the  same  chromosome, 
having  been  received  from  the  same  parent.  One  set  of 
single-locus  genotypes  (a  specification  of  the  unordered 
pair  of  alleles  at  each  locus)  can  correspond  to  many 
different  multilocus  genotypes  (a  specification  of  the 
alleles  on  each  chromosome,  at  each  of  the  loci).  Thus 
in  computing  a  likelihood,  for  a  given  locus  order  and 
set  of  recombination  fractions,  a  huge  sum  over  all 
the  possible  configurations  of  haplotypes  is  required. 
With  the  increasing  availability  of  DNA  markers  there 
is  an  increasing  potential  for  mapping  traits  with  more 
limited  trait  data  or  more  complex  modes  of  expression. 
However,  more  markers,  and  marker  loci  with  more 
alleles,  and  traits  observable  for  a  more  limited  subset  of 
the  pedigree  members,  all  compound  the  computational 
difficulties,  since  the  number  of  possible  underlying 
configurations  of  genes  on  all  the  relevant  members  of 
the  pedigree  increases  vastly. 

Thus,  with  the  increasing  desire  to  examine  multiple 
markers,  and  markers  with  multiple  alleles,  a  major 
limitation  of  linkage  analysis  has  become  the  practical 
and  theoretical  bounds  on  the  computational  feasibility 


of  likelihood  evaluation.  There  are  many  further  aspects 
of  linkage  analysis,  and  many  alternative  approaches  to 
localising  the  genes  responsible  for  a  genetic  disease,  A 
much  fuller  description  of  standard  statistical  methods 
in  linkage  analysis  may  be  found  in  the  text  by  Ott 
(1991). 

In  this  paper,  we  consider  one  possible  approach  to 
the  computations  needed  to  map  a  rare  recessive  disease 
from  data  on  affected  inbred  individuals.  We  consider 
only  marker  loci  at  which  the  types  of  the  two  genes 
carried  by  a  observed  individual  are  known,  and  a 
recessive  disease  for  which  it  is  known  whether  or  not 
an  observed  individual  carries  two  copies  of  the  disease 
allele.  The  (multilocus)  genotype  Gi  of  individual  i  is  a 
specification  of  the  types  of  the  genes  on  each  of  a  pair 
of  chromosomes  of  the  individual.  The  phenotype  Yi  of 
i  is  a  specification  of  the  observed  trait  characteristics 
determined  by  the  underlying  genotypes.  We  subsume 
all  the  parameters  of  the  genetic  model  into  the  parame¬ 
ter  vector  and  use  Pb{')  to  denote  probabilities  under 
the  model.  The  total  set  of  genotypes  on  a  pedigree  is 
denoted  G,  and  of  observed  phenotypes  Y. 

2.  Monte  Carlo  likelihood 

Monte  Carlo  estimates  of  integrals  or  expectations 
are  not  new,  either  in  general  (Hammersley  and  Hand- 
scomb,  1964)  or  in  genetic  linkage  analysis  (Thompson 
et  al.  1978).  However,  Monte  Carlo  methods  have  only 
become  widely  used  with  the  explosion  in  use  of  Markov 
chain  Monte  Carlo  (MCMC)  which  permits  simulation 
from  distributions  known  only  up  to  a  normalising 
constant,  and  hence  simulation  from  conditional  dis¬ 
tributions.  The  statistical  problems  involved  in  fitting 
genetic  linkage  models  to  trait  data,  Y,  on  a  set  of 
related  individuals  may  be  viewed  as .  latent  variable 
or  ”  missing  data”  problems.  Were  all  the  underlying 
genetic  events  observable,  likelihood  computation  and 
parameter  estimation  would  be  trivial,  but  only  trait 
data  (phenotypes)  of  some  individuals  are  observed.  We 
denote  the  latent  variables  by  X. 

The  likelihood  is 


m  =  P,(Y)  =  ^P«(Y,X)  =  ^P,(Y|X)P«(X)  (1) 
X  X 


Although  the  summation  may  be  infeasible,  we  suppose 
that  the  latent  variables,  X,  are  chosen  in  such  a  way 
that  each  term  of  the  expression  is  easily  computed. 

Monte  Carlo  estimators  of  likelihood  ratios  can  be 
based  on 


m  _  Pe{Y)  {PeiY,X)  \ 

Li9o)  PeAY)  *“'vP«o(Y,X)  V 


(2) 


(Thompson  and  Guo,  1991),  provided  simulation  from 
the  appropriate  distribution  is  possible.  Suppose  X(/), 
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I  =  Ij.-.jJV,  are  realisations  from  P5q(X|Y)  then  a 
Monte  Carlo  estimate  of  the  likelihood  ratio  (2)  is 

From  an  importance  sampling  perspective,  the  estima¬ 
tor  (3)  is  efficient;  for  values  of  close  to  6  the  sampling 
distribution  mimics  the  shape  of  the  integrand  Pd(Y,  X) 
of  (1).  Further,  equation  (3),  through  simulation  at  a 
given  ^0,  provides  a  likelihood  ratio  approximant,  as 
a  function  of  in  the  sense  of  Geyer  and  Thompson 
(1992).  At  least  for  values  of  6  close  to  ^o,  a  single 
simulation  provides  an  estimate  of  the  local  likelihood 
surface. 

In  Monte  Carlo  approaches  to  complex  problems  with 
many  latent  variables,  the  key  is  simulation  conditional 
upon  data;  that  is  from 

P9„(X|Y)  =  P,„(X,Y)/P<,„(Y)  (4) 

With  well  chosen  latent  variables  X,  the  numerator  of 
this  expression  is  readily  evaluated,  but  the  denominator 
is 

Li6o)  =  Pe„(Y)  =  ^P,„iX,Y) 

X 

and  this  summation  is  often  infeasible.  The  denomi¬ 
nator  is,  in  fact,  precisely  the  likelihood  whose  exact 
evaluation  is  often  impossible,  necessitating  the  Monte 
Carlo  estimation. 

Metropolis- Hastings  algorithms  are  Markov  chain 
Monte  Carlo  methods  designed  to  meet  this  need,  pro¬ 
viding  realisations  (approximately)  from  a  distribution 
known  up  to  a  normalising  constant  (Hastings,  1970). 
For  each  X  a  “proposal  distribution”  ^(‘,X)  is  defined. 
Then,  if  the  process  is  now  at  X  the  next  value  is 
generated  as  follows: 

1.  Generate  X*  from  the  proposal  distribution  g(*,X) 

2.  Compute  the  Hastings  ratio 

_  g(X,X*)P,„(X*|Y)  _  g(X,X*)P,„(Y,X*) 
9(X*,X)P<,„(X|Y)  g(X*,X)P9„(Y,X) 

Note  that  A  can  be  computed  without  knowledge  of 

P^oiY). 

3.  With  probability  A*  =  min(l,A)  the  process  moves 
to  X*  and  with  probability  (1  —  A*)  it  remains  at  X. 
The  distribution  (4)  is  an  equilibrium  distribution  of  the 
Markov  chain  just  defined.  Provided  g(*,  •)  is  chosen  so 
that  the  chain  is  ergo  die,  running  the  chain  provides 
(after  a  sufficient  number  of  steps  for  convergence) 
realisations  from  the  distribution  (4).  The  algorithm  of 
Metropolis  et  al.  (1953)  is  a  special  case;  if  ^(X*,X)  = 


^(X,X*)  the  Hastings  ratio  reduces  to  the  odds  ratio  of 
the  proposal  state  X*  versus  the  current  state  X. 

In  the  genetic  context,  the  latent  variables  X  have 
normally  been  taken  to  be  the  underlying  multilocus 
genotypes  (the  pairs  of  haplotypes)  carried  by  each  in¬ 
dividual  in  the  pedigree.  This  makes  for  easy  evaluation 
of  P^o(X,  Y)  but  not  for  easy  sampling  of  the  large  space 
of  possible  X- values.  The  space  of  Lange  and  Matthysse 
(1989)  is  even  larger,  including  also  indicators  of  the 
grandparental  origins  of  genes.  Although  local  updat¬ 
ing  methods  are  very  slow,  they  are  convenient  for 
genetic  analysis  problems.  If  large  changes  in  genotypic 
configuration  are  proposed,  the  Hastings  ratio  can  be 
impossible  to  compute,  and  constraints  in  the  feasible 
genotypic  patterns  on  pedigrees  mean  that  almost  all 
proposals  have  zero  probability. 

There  are  various  approaches  to  improving  sampler 
performance  in  genetic  problems.  Lin  (1993)  made  great 
progress  towards  increasing  the  practicality  of  MCMC 
methods  in  linkage  analysis,  using  Metropolis-coupled 
samplers  (Geyer,  1991),  and  a  form  of  “heating”  in 
the  Metropolis-Hastings  steps  to  improve  mixing  of  the 
chain.  Geyer  and  Thompson  (1994)  used  simulated 
tempering  (Marinari  and  Parisi,  1992)  to  make  sampling 
feasible  on  a  very  large  complex  pedigree  with  many 
constraints.  These  strategies  result  in  a  sampler  that 
can  sample  genotypes  efficiently  on  a  large  pedigree. 
However,  for  several  linked  markers,  the  huge  space  of 
possible  genotypic  configurations  that  then  arises  may 
render  the  sampler  ineflFective, 

An  alternative  approach  is  to  consider  alternative 
latent  variables  X,  to  produce  a  smaller  space  more 
easily  sampled  by  MCMC  methods.  Note  that  the 
requirements  on  X  are  only  that  P^(Y,X)  should  be 
very  quickly  computable.  Now  P^(Y,X)  is  normally 
computed  as  Pe{Y  \  X)P^(X).  Thus  any  X  for  which 
these  two  factors  can  be  readily  computed  will  suffice. 
For  the  problem  of  mapping  rare  recessive  traits  from 
data  on  inbred  affected  individuals,  it  is  possible  to 
bypass  the  multilocus  genotypes  of  unobserved  individ¬ 
uals,  and  use  only  segregation  indicators  as  the  latent 
variables. 

3.  Homozygosity  mapping. 

In  linkage  analysis,  due  to  uncertainties  as  to  whether 
an  unaffected  individual  carries  a  disease  gene,  the 
computational  difficulties  on  extended  pedigrees,  and 
the  costs  of  typing  large  numbers  of  individuals,  there 
have  been  many  approaches  towards  basing  linkage 
analyses  on  a  small  number  of  observed  (usually  af¬ 
fected)  individuals.  The  extreme  case  is  homozygosity 
mapping  in  which  a  rare  recessive  is  mapped  using  only 
marker  and  trait  data  on  independent  inbred  affected 
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individuals. 

It  W8LS  first  pointed  out  by  Smith  (1953),  that  in¬ 
dividuals  affected  with  rare  recessive  diseases  provide 
information  for  linkage  analysis,  even  without  any 
marker  or  phenotype  data  on  other  relatives.  For  a 
recessive  disease,  affected  individuals  are  homozygous  at 
the  disease  locus;  that  is,  they  carry  two  copies  of  the 
same  allele.  For  a  rare  disease,  many  affected  individuals 
are  so  through  being  the  offspring  of  consanguineous 
marriages,  and  thus  receiving  two  copies  of  the  disease 
gene  identical- by-descent  or  autozygous  from  a  recent 
common  ancestor  of  the  two  parents.  In  this  case,  the 
affected  individual  is  likely  to  be  homozygous  also  at 
closely  linked  markers,  and  this  homozygosity  provides 
evidence  for  linkage.  Unrelated  inbred  individuals  will 
be  homozygous  at  independent  segments  of  the  genome, 
but  the  shared  affected  status  of  the  individuals  will 
cause  shared  homozygosity  in  the  neighbourhood  of  the 
disease  locus.  The  scope  of  homozygosity  mapping, 
which  is  simply  linkage  analysis  using  data  only  on 
unrelated  inbred  affected  individuals,  was  extended  by 
Lander  and  Botstein  (1987).  With  a  dense  map  of 
highly  polymorphic  DNA  markers,  a  small  number  of 
affected  individuals  can  provide  substantial  information 
for  mapping  a  recessive  disease  gene. 

Linkage  analysis  is  the  analysis  of  cosegregation  of 
genes  at  different  loci,  from  parents  to  offspring.  If  two 
loci  are  tightly  linked,  there  is  a  high  probability  that  if 
the  individual  receives  a  grandmaternal  [grandpaternal] 
allele  from  his  mother  at  one  locus,  he  will  do  so 
also  at  the  adjacent  one,  and  similarly  for  the  gene 
received  from  his  father.  The  key  underlying  events  that 
determine  the  data  on  the  affected  inbred  individual  are 
the  segregations  that  specify  the  ancestral  genes  that  he 
receives.  Let  m  and  p  index  the  maternal  and  paternal 
segregations  to  some  individual.  Let  Smj  =  0  if  the 
maternal  allele  received  by  the  individual  at  locus  j  is 
of  grandmaternal  origin,  and  Smj  =  1  otherwise,  and  let 
Spj  be  similarly  defined  for  the  paternal  allele.  Then,  at 
any  locus  j, 

P(Smj  =  0)  =  P{Smj  =  1)  - 

PiSpj  =  0)  =  PiSpj  =  1)  =  5 

and  at  two  loci  h  and  j 

P{Smh  —  Smj)  =  P{Sph  =  Spj)  =  (1  —  Th^j) 

where  Vh^j  is  the  recombination  fraction  between  the  two 
loci. 

Then  for  a  given  segregation  t,  the  recombination 
events  are  determined  by  segregation  indicators  Sij, 
j  =  1,...,^:,  where  Sij  is  0  or  1  as  the  origin  of 


Figure  1:  A  first  cousin  marriage,  showing  segregation 
indicators. 

the  segregating  gene  at  locus  j  is  grandmaternal  or 
grandpaternal,  respectively.  That  is,  we  shall  take  the 
indicators  S  =  {S'ij}  as  the  latent  variables  X  in  the 
Monte  Carlo  likelihood  framework  of  section  2.  Figure  1 
shows  the  case  of  the  offspring  of  a  first-cousin  marriage. 
At  any  locus,  the  offspring  individual  may  receive  geiies 
autozygous  from  either  of  his  parents’  common  grand¬ 
parents;  there  are  eight  relevant  segregation  indicators 
that  will  specify  the  gene  descents. 


Table  1:  Example  of  segregation  array,  for  the 
_ pedigree  of  figure  1 _ 


Segreg.: 

Si 

52 

53 

54 

55 

56 

57 

58 

Locus 

Li 

0 

1 

1 

1 

0 

1 

0 

0 

L2 

0 

1 

1 

0 

0 

1 

0 

0 

Ls 

0 

1 

1 

0 

0 

1 

0 

1 

Li 

0 

1 

0 

0 

0 

1 

0 

1 

Table  1  shows  four  successive  patterns  of  values  for 
the  eight  segregation  indicators  of  figure  1,  such  as 
might  arise  along  a  chromosome  segment,  or  at  four  loci. 
In  the  first  pattern,  the  paternal  gene  of  the  offspring 
individual  derives  from  his  grandmother  {Si  =  0),  and 
is  the  paternal  gene  of  this  grandmother  (^4  =  !)>  and  is 
in  fact  the  great-grandfather’s  maternal  gene  (55  =  0). 
Likewise  the  final  individual’s  maternal  gene  is  this  same 
maternal  gene  in  his  great-grandfather;  the  individual  is 
autozygous  for  this  gene.  By  locus  2,  54  has  become  0; 
the  final  individual’s  paternal  gene  is  now  the  paternal 
{Sq  =  1)  gene  of  his  great-grandmother.  By  locus  3, 
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Sg  has  become  1;  this  leaves  the  genes  in  the  final 
individual  unchanged,  since  =  1,  so  the  grandfather’s 
maternal  gene  is  not  transmitted.  However,  by  locus  4, 
S3  becomes  0;  now  the  final  individual  is  autozygous  for 
the  paternal  gene  in  his  great-grandmother. 

Consider  now  Table  1  as  illustrating  a  possible  value 
of  the  state  S  at  four  loci  being  used  in  a  genetic  anal¬ 
ysis.  The  prior  probabilities  of  S  are  straightforward. 
However,  for  implementation  of  a  Metropolis  algorithm, 
relative  values  of  P^o(Y,  S)  are  required,  or  P$q(Y  |  S). 
The  binary  indicators,  S  =  {5'*;},  of  grandparental 
origins  of  genes  in  each  given  oifspring  individual,  at 
each  locus  readily  determine  the  multilocus  autozygosity 
patterns  in  the  observed  individual.  This  is  done 
simply  by  following  the  descent  paths  of  genes,  as  in 
the  example  described  above;  an  eflScient  algorithm  is 
easily  implemented  to  update  these  descent  paths,  and 
hence  the  resulting  autozygosity  pattern,  when  a  Sij 
changes.  For  a  single  observed  individual,  the  autozy¬ 
gosity  pattern  is  k  binary  indicators,  specifying  whether 
or  not  the  Sij  result  in  the  individual  having  two  genes 
autozygous  at  locus  j  =  The  probability  of 

a  genotype  homozygous  for  an  allele  with  frequency  q 
is  or  q,  as  the  individual  is  not /is  autozygous  at  the 
locus.  The  probability  of  a  heterozygous  genotype  is  0 
if  the  individual  is  autozygous  at  the  locus,  and  is  2qiq2 
otherwise,  where  qi  and  q2  are  the  two  allele  frequencies. 


Table  2:  Probability  ratios  of  segregation 
indicators  Sij 


Si,j-1 

probability  ratio* 

1 

1 

{l-rj-i){l-rj)/rj-irj 

1 

0 

(l-rj_i)ry/rj_i(l-ry) 

0 

1 

0 

0 

rj-irj/{l-rj.i){l-rj) 

*  :  P{Sij  =  1  I  S_(f^))/P(5./=  0  I 

rh  is  the  recombination  frequency  between  Lh  and  Lh^i . 

Note  also: 

PiSij  =  1  I  =  PiSij  =  1  I 

P{Sij  =  0  I  =  P{Sij  =  0  I  Sij^ijSij^i) 

denotes  all  elements  of  S  other  than  Sij. 

The  space  of  S-values  is  also  easy  to  sample  from. 
The  simplest  algorithm  uses  a  Metropolis  proposals  to 
change  the  grandparental  origin  of  the  gene  at  a  random 
locus  in  a  random  segregation.  The  probability  ratio  for 
the  proposed  change  in  S  depends  only  on  the  indicators 
at  adjacent  loci  for  the  same  segregation  (Table  2).  For 
example  suppose  the  current.  S  were  that  of  Table  1,  and 
the  proposal  was  to  change  ^4^2  from  its  current  value 
0  to  1.  This  would  eliminate  a  recombination  between 


loci  1  and  2  (54, 1  =  1)  giving  probability  ratio 

(1  -n,2)/»’i,2, 

but  create  one  between  loci  2  and  3  (^4,3  =  0)  giving 
another  factor 

^2,3/ (1  —  ^2,3)* 

This  recombination  ratio  is  then  weighted  by  the  ap¬ 
propriate  conditional  probability  of  phenotypic  obser¬ 
vations  Poq{Y  I  S),  for  current  and  proposed  S-values, 
This  sampler  is  clearly  irreducible:  if  a  given  pattern 
of  autozygosity  in  the  observed  individual  is  compatible 
with  the  data,  then  so  also  is  any  pattern  with  fewer 
loci  at  which  the  affected  individual  is  autozygous  and 
hence  homozygous. 

Werner’s  syndrome  (WS)  is  a  very  rare  recessive 
genetic  disease  of  premature  aging.  It  has  recently 
been  mapped  to  chromosome  8  using  outbred  affected 
relatives  (Goto  et  al.,  1992),  and  this  linkage  has  been 
confirmed  by  analysis  of  a  set  of  inbred  aflfected  individ¬ 
uals  (Schellenberg  et  al.,  1992)  in  21  small  pedigrees  of 
Japanese  and  Caucasian  origin.  The  frequency  of  the 
disease  allele  is  assumed  to  be  0.004.  A  Monte  Carlo 
linkage  likelihood  analysis  of  a  subset  of  five  of  these 
pedigrees  is  given  by  Thompson  (1994);  here  we  use  just 
two  of  the  pedigrees  for  purposes  of  illustration.  Two 
markers  were  of  significance  in  the  published  linkage  re¬ 
ports:  D8587  and  ANK.  Originally  ANK  and  D8587 
were  thought  to  be  flanking  markers,  but  the  likely  order 
is  now  thought  to  be  {WS,  D^S^l,  ANK).  For  the 
purposes  of  illustration  only,. we  take  the  recombination 
fractions  between  WS  and  D8587  and  between  D8S87 
and  ANK  each  to  be  0.1;  this  is  probably  larger  than 
the  true  values.,  but  of  the  correct  order  of  magnitude. 
Data  and  information  on  these  markers  were  provided 
by  Dr.  Ellen  Wijsman  (personal  communication). 

4.  Autozygosity  probabilities 

In  fact,  for  a  single  affected  inbred  individual,  the  data 
Y  at  a  position  /i  on  a  chromosome  depend  on  S(/i)  only 
through  Z{h),  the  autozygosity  (1)  or  non- autozygosity 
(N)  in  the  inbred  affected  individual.  Over  multiple 
loci,  or  along  the  chromosome  continuum,  these  patterns 
of  autozygosity  are  themselves  of  interest.  Although, 
for  a  very  rare  recessive  trait,  the  posterior  probability 
of  autozygosity  at  the  disease  locus  is  very  high,  the 
probability  that  all  of  a  set  of  unrelated  affected  inbred 
individuals  are  autozygous  may  be  low.  Further,  the 
way  in  which  such  posterior  probabilities  are  influenced 
by  data  on  linked  markers  is  non-trivial,  for  the  patterns 
of  autozygosity  along  a  chromosome  segment  follow  no 
simple  process.  Specifically,  even  in  the  absence  of 
interference,  the  process  is  not  Markov,  since  it  is  an 
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aggregate  process  and  shows  the  clumping  phenomenon 
typical  of  such  processes  (Aldous,  1989;  Blossey,  1993). 

Consider  first  the  prior,  disregarding  data  Y.  The 
smallest  non-trivial  example  consists  of  the  offspring  of 
a  half-sib  mating  (figure  2).  This  is  also  the  largest 
example  for  which  the  space  of  S-values  can  be  drawn 
readily  (figure  3).  As  one  moves  along  the  chromosome, 
the  process  S{h)  performs  a  random  walk  at  rate  n  on 
the  vertices  of  the  n-dimensional  hypercube  (Donnelly, 
1983).  Here,  n  =  4  and,  without  loss  of  generality, 
the  two  vertices  positioned  as  shown  in  figure  3  are 
those  which  result  in  autozygosity  of  the  inbred  offspring 
individual:  Z(/i)  =  /  if  S\{h)  =  S2{h)  =  1  and 
Sz{h)  =  54(h).  Overall,  P{Z{h)  =  /)  =  2/16  =  0.125. 
When  Z{h)  =  /,  the  next  jump  of  the  random  walk 
will  require  Z{h)  =  N\  when  Z{h)  =  iV,  the  next 
jump  results  m  Z{h)  =  I  with  overall  probability 
(2  X  I  -b  4  X  ^)/14  =  1/7.  However,  although  by 
symmetry  Z{h)  =  /  is  a  renewal  point  of  the  process, 
when  the  process  leaves  Z{h)  =  /,  the  probability  that 
the  next  jump  will  result  in  a  return  to  Z(h)  =  7  is 
3/8.  The  overall  probability,  P{Z{h)  =  7)  can  easily 
be  computed  on  even  a  complex  pedigree;  it  is  simply 
the  inbreeding  coefficient  of  the  individual.  For  two 
loci,  at  given  recombination  fraction,  the  probability 
of  autozygosity  at  both  loci  can  be  computed  by  the 
algorithm  of  Thompson  (1988),  again  even  on  a  complex 
pedigree.  However,  due  to  the  non-Markov  pattern  of 
autozygosity  along  the  chromosome,  these  marginal  and 


Figure  3:  Random  walk  structure,  corresponding  to 
figure  2. 

Table  3:  Prior  autozygosity  probabilities  for 
_ cousin  marriage. _ 


state  Z 

Markov^^^ 

N 

N 

N 

0.8825 

0.8811 

'  N 

N 

I 

0.0264 

0.0277 

N 

I 

N 

0.0131 

0.0131 

N 

I 

I 

0.0155 

0.0155 

I 

N 

N 

0.0264 

0.0277 

I 

N 

I 

0.0022 

0.0009 

I 

I 

N 

0.0155 

0.0155 

I 

I 

I 

0.0184 

0.0183 

(1)  Results  from  10^  MCMC  steps  and  10^  i.i.d  realisa¬ 
tions  are  almost  identical  to  10“^. 

(2)  Results  from  assuming  (incorrectly)  a  first-order 
Markov  chain  for  autozygosity  at  successive  loci. 

pairwise  probabilities  do  not  suffice. 

Table  3  shows  the  autozygosity  probabilities  for  three 
loci,  with  recombination  fraction  0.1  between  each  pair 
of  adjacent  loci,  for  the  case  of  a  first-cousin  marriage 
(figure  1).  For  this  small  problem,  exact  results 
could  have  been  obtained,  but  in  fact  these  are  Monte 
Carlo  results,  obtained  both  by  10^  Metropolis  steps 
of  MCMC,  and  also  by  10®  independent  realisations 
from  the  prior.  For  this  problem,  these  two  simulations 
give  comparable  accuracy  (to  ilO”"^)  in  comparable 
computing  time  (about  8  hours  on  a  DEC3100).  Also 
shown  are  the  probabilities  that  would  be  given  by  a 
first-order  Markov  process  with  the  same  pairwise  and 
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marginal  probabilities.  First,  it  can  be  seen  that  Z  =  I 
is  a  renewal  point;  where  the  central  locus  has  Z  =  /  the 
“Markov”  results  agree  with  the  correct  results.  Second, 
the  major  effect,  in  terms  of  relative  error,  is  in  the  case 
Z  =  (/,  iV',  /);  there  is  a  clumping  of  states  Z  =  /  in  the 
jump  chain.  Alternatively  viewed,  there  is  an  increased 
probability  of  small  regions  of  non-autozygosity  (and 
hence  likely  heterozygosity  at  a  highly  polymorphic 
marker)  within  regions  of  autozygosity  (and  hence  ho¬ 
mozygosity).  In  this  example,  the  sequence  (/,  I)  has 
probability  2.5  times  larger  than  a  “Markov”  view  would 
predict. 

Table  4;  Prior  autozygosity  probabilities  for 
_ pedigree  of  figure  4. 


state  Z 

True^^^ 

Markov^^^ 

N 

N 

N 

0.7901 

0.7889 

N 

N 

I 

0.0478 

0.0493 

N 

I 

N 

0.0257 

0.0251 

N 

I 

I 

0.0271 

0.0273 

I 

N 

N 

0.0478 

0.0493 

I 

N 

I 

0.0050 

0.0031 

I 

I 

N 

0.0271 

0.0273 

I 

I 

I 

0.0295 

0.0297 

(1)  Results  from  10^  MCMC  steps  and  10®  i.i.d  realisa¬ 
tions  are  almost  identical  to  10“"^ 

(2)  Results  from  assuming  (incorrectly)  a  first-order 
Markov  chain  for  autozygosity  at  successive  loci. 

Another  example  is  given  in  Table  4.  Many  of  the 
pedigrees  in  the  Werner’s  syndrome  data  set  are  first 
cousin  marriages.  The  more  complex  pedigree  (figure 
4)  was  first  ascertained  as  a  first  cousin  marriage,  but 
later  it  was  discovered  that  each  parent  of  the  affected 
proband  was  also  the  offspring  of  a  first  cousin  marriage, 
as  shown.  Although  this  is  a  small  pedigree,  exact 
linkage  likelihood  computations  become  infeasible  with 
the  standard  methods  with  more  than  three  loci,  due  to 
the  pedigree  complexity.  The  final  offspring  individual 
can  be  autozygous  for  a  gene  in  any  of  the  three  original 
founders  marked.  Again,  the  “true”  results  in  Table  4 
are  Monte  Carlo  resuits  (both  10®  independent  samples 
and  10^  MCMC  steps,  agreeing  to  4  decimal  places). 
The  “Markov”  assumption  again  underestimates  most 
severely  the  probability  of  Z  =  (/,  W,  /).  However,  note 
also  that  now  there  is  no  renewal  when  Z  ^  1\  the 
lack  of  symmetry  of  the  three  relevant  founder  ancestors 
destroys  this  property,  even  though  numerically  the 
discrepancies  are  small. 

Generally,  for  just  three  loci,  only  the  low-probability 
state  Z  =  (/,  W,  /)  shows  substantial  departure  from  the 


Figure  4:  A  more  complex  pedigree. 

first-order  Markov  probability  values.  However,  with 
data,  this  state  may  have  high  posterior  probability. 
One  of  the  first  cousin  marriages  (figure  1)  for  the 
Werner’s  syndrome  {WS)  data  illustrates  this.  The 
data  consist  of  homozygosity  (affected)  at  the  WS  locus 
(allele  frequency  0.004),  heterozygosity  at  the  marker 
locus  D8S87  (for  two  alleles,  each  frequency  0.5)  and 
homozygosity  at  the  ANK  marker.  The  allele  at  ANK 
has  population  frequency  0.44,  so  homozygosity  is  not 
strong  evidence  of  autozygosity,  but  the  example  will 
serve. 

Table  5;  Posterior  autozygosity  probabilities  for 
_ cousin  marriage. _ 


state  Z 

~MCMC^ 

ratio^^^ 

N 

N 

N 

0.1001 

0.1134 

N 

N 

I 

0.0068 

0.2576 

I 

N 

N 

0.7491 

28.3750 

I 

N 

I 

0.1440 

65.4545 

(1)  Results  10^  MCMC  steps,  agree  with  prior  x 
likelihood  to  within  standard  error. 

(2)  Ratio  of  posterior  to  prior  probability  (see  text). 

Table  5  shows  the  posterior  probabilities  of  the  four 
relevant  autozygosity  states;  states  autozygous  at  the 
D8S87  locus  are  eliminated  by  the  data,  and  so  not 
listed.  As  expected,  the  states  with  autozygosity  at  the 
WS  locus  have  much  increased  probability  a  posteriori] 
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the  WS  disease  allele  has  population  frequency  only 
0.004.  Note  in  particular  that  the  state  with  lowest 
prior  probability  now  has  a  probability  0.1440,  65 
times  higher  than  before.  Of  course,  the  posterior 
probabilities  could  also  be  obtained  by  multiplying  the 
prior  state  probabilities  by  the  likelihoods  (and  this 
was  done  as  a  check).  Prior  state  probabilities  can  be 
efficiently  obtained  by  i.i.d  Monte  Carlo,  but  conditional 
probabilities  can  only  be  sampled  via  MCMC.  However, 
even  in  this  simple  example,  the  standard  error  of 
the  MCMC  estimate  for  the  state  INI  is  smaller,  for 
an  equal  amount  of  computing  time,  due  to  the  65- 
fold  factor  between  prior  and  posterior.  When,  as 
here,  the  range  of  the  ratios  of  posterior  to  prior  is 
3  orders  of  magnitude,  sampling  from  the  prior,  and 
using  importance  sampling  to  reweight  to  the  posterior, 
is  far  less  efficient  than  sampling  from  the  posterior, 
even  though  the  latter  requires  use  of  MCMC. 

5.  Discussion 

Monte  Carlo  estimation  provides  an  approach  when 
exact  likelihood  and  probability  computation  is  infea¬ 
sible,  particularly  in  problems  of  complex  dependent 
highly  structured  data,  such  as  arise  in  genetic  analysis. 
There  are  many  ways  to  set  up  the  Markov  chain 
Monte  Carlo  likelihood  estimates  via  a  choice  of  latent 
variables.  In  this  paper,  we  have  focussed  on  one 
particular  choice  -  the  use  of  segregartion  indicators. 
This  seems  to  have  promise  in  cases  where  a  very 
few  individuals  are  observed  on  each  of  a  number  of 
possibly  large  pedigrees,  the  individuals  being  observed 
for  a  number  of  DNA  markers.  A  particular  case  is 
homozygosity  mapping,  where  the  key  is  the  posterior 
pattern  of  autozygosity  (gene  identity  by  descent)  in 
affected  inbred  individuals. 

MCMC  is  used  to  sample  from  posterior  distribu¬ 
tions,  but  this  does  not  require  a  Bayesian  analysis. 
Realisations  from  the  distribution  of  latent  variables, 
conditional  on  the  data,  but  at  prespecified  parameter 
values,  can  be  used  to  provide  efficient  Monte  Carlo 
estimates  of  a  likelihood  surface.  Moreover,  while 
multilocus  genotypes  are  key  unobservables  in  genetic 
analysis,  it  may  not  always  be  efficient  to  consider 
these  the  latent  variables  in  a  Monte  Carlo  analysis; 
segregation  indicators  that  specify  the  passage  of  genes 
segregating  in  a  pedigree  are  more  fundamental  even 
than  genotypes,  and,  provided  the  relevant  probabilities 
of  observed  data  given  the  latent  variables  can  be  easily 
computed,  the  genotypes  of  individuals  can  be  bypassed. 

Autozygosity  patterns  at  multiple  linked  loci  become 
of  increasing  relevance  as  multilocus  linkage  analyses  are 
performed.  The  random  walk  framework  of  Donnelly 
(1983),  and  the  Posson  clumping  heuristic  of  Aldous 


(1989)  together  make  study  of  the  prior  probability 
distribution  of  patterns  more  feasible  (Blossey,  1993). 
However,  in  order  to  assess  autozygosity  in  the  light  of 
data,  or  to  use  realisations  from  the  posterior  distribu¬ 
tion  of  autozygosity  consitional  on  data  in  a  likelihood 
analysis,  MCMC  provides  the  most  efficient  computa¬ 
tional  approach  in  many  cases.  Posterior  probabilities 
of  autozygosity  patterns  are  more  efficiently  estimated 
by  MCMC,  than  by  reweighting  prior  probabilities 
estimated  by  i.i.d  Monte  Carlo. 
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Abstract 


1  Overview 


One  of  the  great  success  stories  of  modern 
molecular  genetics  has  been  the  ability  of  biolo¬ 
gists  to  isolate  and  characterize  the  genes  respon¬ 
sible  for  serious  inherited  diseases  like  Hunting- 
ton^s  disease,  cystic  fibrosis,  and  myotonic  dys¬ 
trophy.  Instrumental  in  these  efforts  has  been 
the  construction  of  so-called  “physical  maps”  of 
large  regions  of  human  chromosomes. 

Constructing  a  physical  map  of  a  chromo¬ 
some  presents  a  number  of  interesting  challenges 
to  the  computational  statistician.  In  addition  to 
the  general  ill-posedness  of  the  problem,  coinpli" 
cations  include  the  size  of  the  data  sets,  com¬ 
putational  complexity,  and  the  pervasiveness  of 
experimental  error.  The  nature  of  the  problem 
and  the  presence  of  many  levels  of  experimental 
uncertainty  make  statistical  approaches  to  map 
construction  appealing.  Simultaneously,  how¬ 
ever,  the  size  and  combinatorial  complexity  of 
the  problem  make  such  approaches  computation¬ 
ally  demanding. 

In  this  paper  we  discuss  what  physical  maps 
are  and  describe  three  different  kinds  of  physical 
maps,  outlining  issues  which  arise  in  construct¬ 
ing  them.  In  addition,  we  describe  our  experi¬ 
ence  with  powerful,  interactive  statistical  com¬ 
puting  environments.  We  found  that  the  ability 
to  create  high-level  specifications  of  proposed  al¬ 
gorithms  which  could  then  be  directly  executed 
provided  a  flexible  rapid  prototyping  facility  for 
developing  new  statistical  models  and  methods. 
The  ability  to  check  the  implementation  of  an 
algorithm  by  comparing  its  results  to  that  of  an 
executable  specification  enabled  us  to  rapidly  de¬ 
bug  both  specification  and  implementation  in  an 
environment  of  changing  needs. 


One  major  goal  of  the  Human  Genome  Project  (Olson 
1993)  is  to  reduce  the  time  and  expense  required  to  iso¬ 
late  and  study  regions  of  biological  interest  by  construct¬ 
ing  physical  maps  of  the  entire  human  genome.  Such 
maps  can  then  be  used  by  other  molecular  biologists  in¬ 
volved  in  the  interesting  and  difficult  task  of  understand¬ 
ing  how  the  approximately  100,000  genes  buried  in  our 
chromosomes  conspire  to  make  us  human  beings. 

In  this  article  we  will  concentrate  on  issues  involved 
in  constructing  physical  maps.  First,  we  will  describe 
what  physical  maps  are.  Then  we  will  discuss  some  of  the 
statistical  and  computational  problems  associated  with 
constructing  various  kinds  of  physical  maps,  specifically 

•  STS  content  maps, 

•  maps  based  on  random  fingerprinting,  and 

•  restriction  maps. 

Finally,  we  will  describe  how  we  used  a  modern,  statis¬ 
tical  computing  environment  to  help  us  with  the  tricky 
task  of  ensuring  that  the  programs  we  implemented  were 
faithful  to  the  ideas  and  algorithms  we  designed. 

As  an  aside,  the  reader  should  be  aware  that  biol¬ 
ogy  is  one  of  those  sciences  where  exceptions  and  special 
cases  abound:  nearly  every  general  statement  one  can 
make  turns  out  to  be  wrong.  Physical  mapping  is  cer¬ 
tainly  no  exception  to  this  situation.  In  the  interests 
of  clarity  and  brevity,  however,  we  will  confine  our  at¬ 
tention  only  to  typical  examples  and  refrain  from  the 
impulse  to  be  general  or  encyclopedic.  For  a  more  thor¬ 
ough  introduction,  see  Nelson  and  Speed  (1994). 


1  Research  was  performed  under  the  auspices  of  the  U.  S.  Department  of  Energy  by  Lawrence  Livermore  National  Laboratory 
contract  number  W-7405-ENG-48,  with  additional  support  from  NSF  grant  DMS-91-13527. 


^Research  was  partially  supported  by  NSF  grant  DMS-91-13527. 
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2  What  is  a  “physical  map?” 

What  are  physical  maps?  The  answer  is  not  as  precise  as 
one  would  like.  To  understand  this,  we  must  first  under¬ 
stand  something  about  recombinant  DNA  techniques  as 
well  as  current  limitations  in  how  regions  of  DNA  can  be 
analyzed  by  molecular  geneticists.  (See  Brown  (1990) 
for  a  readable  introduction  to  recombinant  DNA  tech¬ 
niques  and  genetic  analysis.)  A  fundamental  problem  in 
molecular  genetics  is  that 

•  current  methods  of  chemically  analyzing  substan¬ 
tial  stretches  of  DNA  require  a  sample  containing  a 
large  number  of  identical  molecules,  typically  pro¬ 
duced  by  recombinant  DNA  amplification;  however 

•  the  maximum  size  of  a  region  that  can  be  am¬ 
plified  by  current  techniques  is  orders  of  magni¬ 
tude  smaller  than  even  the  smallest  human  chro¬ 
mosome. 

For  example,  the  size  of  the  longest  contiguous  fragment 
of  DNA  that  can  be  reliably  amplified  by  a  recombinant 
DNA  process  called  “cloning”  ranges  from  around  4  x  10"^ 
to  1  X  10^,  depending  on  the  vector  and  host.  Similarly, 
the  longest  stretch  of  DNA  that  can  be  reliably  ampli¬ 
fied  by  a  purely  chemical  technique  known  as  polymerase 
chain  reaction  (pcr)  is  approximately  1  x  10^  bases.  In 
contrast,  the  twenty- two  human  autosomes  range  in  size 
from  around  3  x  10*  bases  for  chromosome  1  down  to 
about  5  X  10^  bases  for  chromosome  21.  Because  of 
this  mismatch  in  sizes,  producing  enough  DNA  to  permit 
biochemical  analyses  currently  requires  a  process  called 
cloning^  in  which 

•  a  large  number  of  identical  chromosomes  are  bro¬ 
ken  randomly  into  fragments  by  one  or  more  of  a 
class  of  enzymes  known  historically  as  jcsiriciion 
enzymes^ 

•  individual  fragments  of  appropriate  size  are  incor¬ 
porated  by  biological  or  chemical  mechanisms  into 
the  DNA  of  host  organisms  such  as  E.  coli  or  yeast, 

•  the  individual  hosts  are  separated  from  each  other 
and  allowed  to  grow  in  into  colonies,  with  the  frag¬ 


ment  in  each  host  being  replicated  along  with  the 
DNA  of  the  host  during  cell  division  (mitosis). 

In  this  way,  the  natural  DNA  replication  machinery  of 
the  host  organism  is  exploited  to  replicate  the  fragment 
along  with  the  host^s  chromosomes.  After  enough  mi¬ 
toses,  each  host  colony  can  be  harvested.  The  result  of 
this  process  is  a  library  of  cloned  chromosome  fragments, 
where  each  fragment  is  present  in  large  enough  quantities 
to  permit  isolation  and  purification  of  the  fragment  and 
subsequent  biochemical  analyses.  Unfortunately,  the  li¬ 
brary  contains  no  information  about  the  relative  posi¬ 
tions  of  the  fragments  along  the  chromosome.  Physical 
maps  are  data  structures  which  provide  the  necessary  in¬ 
formation  to  enable  the  order  and  distance  among  frag¬ 
ments  to  be  deduced.  Hence,  they  are  essential  if  a  col¬ 
lection  of  overlapping  cloned  chromosome  fragments  (a 
coniig)  is  to  be  treated  as  though  it  were  a  contiguous 
region  of  DNA. 

Outside  of  the  genetics  community,  the  process  of 
physical  mapping  is  much  less  well  known  than  the 
process  of  genetic  mapping^  as  described  by  Elizabeth 
Thompson  (this  proceedings).  Table  1  on  the  following 
page  attempts  to  clarify  the  situation  by  contrasting  sev¬ 
eral  attributes  of  the  two  types  of  maps.  In  both  cases, 
one  is  attempting  to  detect  relationships  and  compute 
“distances”  between  genetic  objects  of  interest.  In  ge¬ 
netic  mapping,  one  uses  data  from  pedigrees  and  pheno¬ 
types  to  estimate  the  expected  number  of  recombinations 
between  two  loci  of  interest. 

In  physical  mapping,  on  the  other  hand,  one  uses 
data  from  experiments  which  we  call  “fingerprints”  to 
determine  order  and  distance  between  clones  or  more  ab¬ 
stract  objects  called  sequence-tagged  sites  (STSs),  which 
we  will  define  presently.  In  this  context,  a  “fingerprint” 
for  a  clone  consists  of  data  from  one  or  more  experiments 
on  that  clone,  the  results  of  which  depend  in  some  way  on 
the  underlying  DNA  sequence.  Hence,  the  results  of  these 
experiments  can  help  identify  or  characterize  the  clone. 
Cloned  fragments  which  overlap,  i.e.,  share  a  portion  of 
the  genome,  may  produce  fingerprints  more  similar  to 
one  another  than  clones  which  do  not  overlap. 
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Genetic  Mapping 

Physical  Mapping 

Objects 

Genes  or  loci 

Clones  or  STSs 

Distance 

Expected  number 

of  recombinations 

Base  pairs 

Data 

Pedigrees  and  phenotypes 

“Fingerprints” 

Goal 

Order  and  distance  among 

Order  and  Distance  among 

genes  or  loci 

clones  or  STSs 

Why? 

Localize  gene  to  small 

Prepare  for  biochemistry: 

region  of  genome 

sequencing,  probing  . . . 

Table  1:  A  comparison  of  genetic  and  physical  mapping. 

Ch  19  Marker  Pairs 

Physical 

Distance 

Genetic  Distances 
Female  Male 

D19S20 

D19S177 

D19S76 

D19S247 
—  D19S76 

D19S179 

1.5  Mb 
2.0  Mb 
12.0  Mb 

9.4  cM 
6.1  cM 
19.1  cM 

30.5  cM 
4.6  cM 
10.7  cM 

Table  2:  A  comparison  of  genetic  and  physical  distances. 


Since  genetic  maps  provide  information  on  distances 
between  loci,  and  loci  can  often  be  associated  with 
clones,  one  might  wonder  why  geneticists  don’t  just  use 
the  genetic  map  to  determine  distances  between  clones. 
Table  2  show  why.  It  describes  physical  and  genetic 
distances  between  four  polymorphic  markers  on  chro¬ 
mosome  19.  These  four  markers  span  a  region  from  a 
point  near  the  end  of  the  short  arm  of  the  chromosome 
(D19S20)  down  to  a  point  near  the  center  of  the  chro¬ 
mosome  (D19S179). 

One  immediately  sees  from  this  table  that  physical 
distance  is  only  loosely  correlated  with  genetic  distance. 
What  is  more,  genetic  distances  are  sex-specific.  Typ¬ 
ically,  many  more  recombinations  occur  in  sperm  than 
in  eggs.  However,  as  can  be  seen  from  the  three  pairs 
in  Table  2,  this  is  not  always  the  case.  Consequently, 
although  genetic  distances  are  used  as  rough  guides  to 
physical  distances  (the  rule-of-thumb  is  10^  bases  per 
centimorgan),  this  correspondence  is  rough  indeed,  and 
physical  maps  must  be  constructed  to  determine  the  pre¬ 
cise  physical  relationships  among  genetic  objects. 

3  Constructing  Physical  Maps 

Now,  let  US  turn  to  the  process  of  constructing  physical 
maps  to  see  what  roles  computers  and  statistics  play.  In 
this  section,  we  will  describe  how  three  different  kinds  of 
maps  are  constructed. 


3.1  STS  Content  Maps 

Most  of  the  large,  low  resolution  physical  maps  now  pub¬ 
lish  are  STS  content  maps  (Green  and  Green  1991).  A 
sequence-tagged  site,  or  STS,  is 

•  a  unique  sequence  in  the  genome,  along  with 

•  a  reliable  biochemical  assay  for  determining 
whether  or  not  any  given  segment  of  DNA  contains 
that  sequence. 

Hence,  one  can  determine  with  low  probability  of  error 
whether  or  not  a  clone  contains  any  given  STS.  In  this 
case,  the  “fingerprint”  for  a  clone  is  the  collection  of  STSs 
it  contains. 

Figure  1  contains  a  diagram  of  a  toy  example  which 
we  will  use  to  describe  issues  in  STS  content  mapping. 
Each  horizontal  line  represents  a  clone.  In  the  diagram, 
we  show  five  clones,  labeled  1  through  5.  The  five  clones 
overlap  in  the  way  indicated,  although  we  don’t  know 
that,  of  course.  Each  vertical  arrow  represents  an  STS. 
In  the  diagram,  we  show  five  STSs,  labeled  a  through 
e.  The  task  is  to  use  information  about  which  clones 
contain  which  STSs  to  determine  the  correct  order  of  the 
STSs.  If  we  can  determine  without  error  which  clones 
contain  which  STSs,  the  following  algorithm  will  produce 
a  correct  ordering. 
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Figure  1:  A  simple  example  with  five  clones  and  five  STSs. 


First,  construct  an  incidence  matrix  A  containing  one 
row  per  clone  and  one  column  per  STs  probe.  Let  Aij  =  1 
whenever  clone  i  contains  probe  j,  and  0  otherwise.  For 
our  example  in  Figure  1,  the  matrix  would  be 


STSs 


a 

b 

C 

d 

e 

1 

1 

1 

0 

0 

~T 

2 

0 

1 

0 

0 

0 

Clones 

3 

1 

0 

0 

1 

1 

4 

0 

0 

0 

1 

0 

5 

0 

0 

1 

0 

0 

Each  ordering  of  the 

probes 

corresponds 

to  a  permuta- 

tion  of  the  columns 

of  A,  which 

we 

can  represent  by 

AP,  where  P  is  a  permutation  matrix.  As  the  clones  in 
Figure  1  are  intervals,  and  we  have  perfect  detection,  it 
is  immediately  clear  that  correct  orderings  P  must  per¬ 
mute  the  columns  of  A  so  that  all  the  ones  in  each  row  of 
AP  appear  consecutively.  Conversely,  any  permutation 
P  for  which  AP  has  all  ones  in  each  row  appearing  con¬ 
secutively  corresponds  to  a  correct  probe  ordering.  Thus 
the  problem  reduces  to  finding  all  permutation  matrices 
P  for  which  AP  has  all  the  ones  in  each  row  appearing 
consecutively. 

An  incidence  matrix  whose  columns  can  be  permuted 
so  that  all  the  ones  in  its  rows  appear  consecutively  is 
said  to  have  the  consecutive  ones  property  for  rows.  For¬ 
tunately,  it  is  easy  to  check  if  a  matrix  has  the  consec¬ 
utive  ones  property  for  rows.  Booth  and  Lueker  (1976) 
describe  linear  time  algorithms  which  perform  the  check 
and  return  all  correct  permutations  in  a  data  structure 
called  a  “PQ  tree”.  Hence,  if  the  data  are  perfect,  the 
problem  is  solvable  in  linear  time. 

In  our  example,  the  two  permutations  (i,  a,  e,  ri,  c) 
and  (6,  e,  a,  d,  c)  both  produce  the  identical  permuted 


matrix: 


Clones 


STSs 

b  a  e  d  c 

1  11  1  0  0 

2  1  0  0  0  0 

3  0  1110 

4  0  0  0  1  0 

5  0  0  0  0  1 


and  hence  both  are  orderings  consistent  with  the  data 
A.  Also  note  that  the  locations  of  the  runs  of  ones  in 
the  rows  of  AP  provide  an  indication  of  precisely  what 
spatial  relationships  among  the  clones  can  be  deduced 
from  the  data. 


Unfortunately,  the  data  are  never  perfect.  False  neg¬ 
ative  rates  of  up  to  ten  percent  are  not  unusual  (S.  Lewis, 
private  communication).  Even  more  unfortunately,  the 
existence  of  errors  renders  the  problem  much  more  dif¬ 
ficult.  The  consecutive  ones  property  is  lost,  and  the 
problem  of  finding  some  nearby  matrix  A'  which  does 
have  the  consecutive  ones  property  is,  in  general,  NP 
hard. 

Current  approaches  to  handling  data  with  errors 
treat  the  problem  as  one  of  combinatorial  optimization. 
In  general,  combinatorial  optimization  problems  involve 
searching  over  some  large,  but  finite  space  in  an  attempt 
to  minimize  some  objective  function  defined  on  elements 
of  that  space.  Issues  to  be  resolved  include  the  structure 
of  the  space,  the  nature  of  the  objective  function,  and 
the  strategy  used  to  search  the  space.  In  the  case  of  STS 
content  mapping,  the  search  space  is  the  space  of  all  per¬ 
mutations  on  n  letters,  where  n  is  the  number  of  probes. 
The  objective  function  is  usually  something  like  total 
number  of  runs  of  ones,  or  perhaps  minus  a  pseudo  log- 
likelihood  of  the  data  given  the  underlying  probe  order. 
At  any  given  step  in  the  search,  the  next  permutation 
to  be  tested  is  determined  heuristically,  and  simulated 
annealing  is  often  used  to  escape  from  local  minima. 

LLNL  has  taken  this  approach  in  their  attempts  to 
produce  an  integrated  map  of  chromosome  19.  The  data 
for  llnl’s  integrated  map  consists  of  over  2800  probes 


D.  O.  Nelson  and  T.  Speed  511 


on  725  different  segments  of  DNA.  An  even  larger  exam¬ 
ple  is  provided  by  Genethon^s  1st  Generation  STS  map 
(Cohen  et  al.  1993),  which  contains  information  about 
2100  STSs  and  6580  clones.  Such  large  map  construction 
efforts  take  many  hours  to  compute,  even  when  run  on 
large  workstations.  Some  recent  work  by  Karp  and  his 
colleagues  (R.  Karp,  private  communication)  indicates 
that  solutions  might  be  able  to  be  more  quickly  com¬ 
puted  by  treating  the  problem  as  a  Hamming  distance 
Traveling  Salesman  Problem  (tsp),  and  exploiting  the 
wealth  of  heuristics  developed  to  solve  TSP  problems. 

3.2  Maps  Based  on  ‘‘Random”  Finger¬ 
printing 

STS  content  maps  are  not  the  only  kind  of  physical  map 
currently  being  constructed.  One  can  also  build  maps  of 
clone  libraries  “bottom-up”  by  a  two  stage  process: 

•  use  a  fingerprint-based  similarity  measure  to  mea¬ 
sure  the  similarity  of  any  pair  of  clones  in  the  li¬ 
brary,  and  then 

•  use  this  similarity  measure  in  a  clustering  proce¬ 
dure  (Mardia,  Kent,  and  Bibby  1979)  to  construct 
contigs. 

The  type  of  similarity  measure  used  depends  in  large 
part  on  the  nature  of  the  fingerprint  data.  LLNL  relied 
on  a  probability-based  fingerprint  when  it  used  this  ap¬ 
proach  as  its  first  step  in  constructing  a  map  of  chromo¬ 
some  19.  In  this  situation,  we  obtain  a  random  “match” 
vector  Dij  for  each  pair  of  clones  i  and  j.  In  addition,  we 
have  a  simplified  statistical  model  which  enables  us  to 
compute  Pr(Ai  |  f),  where  t  e  [0, 1]  is  the  proportion  of 
DNA  shared  by  the  two  clones.  Using  this  model,  we  can 
compute  the  posterior  odds  of  overlap,  given  the  data, 
up  to  a  constant: 

Pr(overlap  |  Dij)  ^  Pr(Aj  |  overlap)  ___ 

Pr(no  overlap  |  Dij)  Pr(Dtj  |  no  overlap) 

where 

.  /.sio.nPKOvMjdPOlix)) 

= - pFiSTiT^o) - 


We  then  use  log  L{i^j)  as  our  similarity  measure  in  a 
“smarter-than- aver  age”  single-linkage  clustering  proce¬ 
dure  (T,  Slezak,  personal  communication). 

Now,  computing  L{ijj)  can  be  quite  laborious.  In 
our  case,  we  have  over  10^  clones  to  assemble  into  con- 
tigs.  Hence,  we  need  to  compute  over  5  x  10^  different 
L(iyj)  values  to  assemble  a  map,  where  each  L(«,i)  is  a 
numerical  integration.  Currently  this  process,  even  with 
several  heuristic  screening  procedures  to  screen  out  ob¬ 
viously  non-overlapping  clones,  runs  several  days  on  a 
network  of  over  30  workstations. 

3.3  Restriction  Maps  for  Validating 
Contigs 

Once  we  have  a  putative  map  for  a  set  of  clones,  we 
then  need  to  validate  the  overlap  configuration  among 
the  clones.  We  do  so  by  constructing  a  restriction  map 
of  the  clones.  Constructing  these  maps  rapidly  currently 
poses  a  large,  and  as  yet  unsolved,  computational  chal¬ 
lenge. 

Figure  2  shows  an  example  of  a  restriction  map  of 
a  large  contig  containing  twenty-eight  clones,  labeled 
Fi7252  through  D716.  (The  labels  are  meaningful  to 
the  biologist,  but  are  irrelevant  to  this  discussion.)  The 
DNA  in  each  clone  is  represented  by  a  horizontal  line  pro¬ 
portional  to  its  length  in  bases.  Each  tick  mark  on  each 
line  represents  a  restriction  site:  a  specific  sequence  (in 
this  case  GAATTc)  that  will  recognized  by  a  particular  re¬ 
striction  enzyme.  Under  the  right  conditions,  restriction 
enzyme  molecules  will  bind  to  DNA  molecules  at  restric¬ 
tion  sites  and  cut  the  DNA  into  fragments  whose  sizes  can 
be  measured.  For  instance,  the  five  tick  marks  on  clone 
F6320  indicate  that  that  clone  contains  five  restriction 
sites,  and  that  when  digested,  it  will  produce  six  frag¬ 
ments  whose  relative  sizes  are  indicated  by  the  distances 
between  the  tick  marks.  The  line  of  tick  marks  at  the 
bottom  of  the  figure  indicate  the  positions  and  distances 
between  all  of  the  sites  in  the  stretch  of  DNA  spanned  by 
the  contig. 

One  begins  to  construct  a  restriction  map  by  digest¬ 
ing  each  clone  and  measuring  the  lengths  of  the  resulting 
fragments.  Then,  given  the  list  of  clones  and  observed 
fragment  sizes  for  each  clone,  one  attempts  to  lay  out  the 
clones  and  line  up  all  the  fragments  to  produce  a  map 
like  in  Figure  2. 
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Figure  2:  a  restriction  map  of  a  large  contig. 


Currently,  these  maps  are  all  produced  manually,  by 
an  experienced  mapper  using  a  spreadsheet.  There  are 
no  auiomaiic  ^programs  to  'produce  these  maps  from  a  coU 
lection  of  clones  and  measured  fragment  sizes.  A  num¬ 
ber  of  issues  complicate  the  construction  of  these  maps. 
First,  the  problem  is  combinatorially  explosive.  Figure  3 
shows  a  graph  of  the  number  of  possible  consistent,  topo¬ 
logically  distinct  arrangements  of  clone  beginnings  and 
endings,  as  a  function  of  the  number  of  clones  in  a  contig 
(Newberg  1993).  Note  the  log  scale. 


Second,  the  measurement  of  fragment  sizes  is  ap¬ 
proximate  and  incomplete.  The  measurements  are  ap¬ 
proximate  in  that,  under  good  conditions,  fragment 
lengths  can  be  measured  to  within  about  one-half  per¬ 
cent  (Lamerdin  and  Carrano  1993).  In  addition,  the 
measurements  are  incomplete  in  that  it  is  sometimes 
difficult  to  determine  exactly  how  many  fragments  of  a 
given  size  have  been  digested.  Also,  there  is  left  censor¬ 
ing:  very  small  fragments  are  sometimes  not  measured 
at  all. 
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Figure  3:  number  of  distinct  interleavings  versus  contig  size. 


Third,  perusal  of  the  map  will  reveal  that  the  sizes 
of  the  first  and  last  fragments  in  a  clone  do  not  match 
the  sizes  of  interior  fragments  of  other  clones.  This  is 
because  the  ends  of  the  clones  do  not  correspond  to  re¬ 
striction  sites  for  the  enzyme  used  to  create  the  restric¬ 
tion  map.  The  sizes  of  the  end  fragments  simply  must 
not  exceed  the  sizes  of  the  interior  fragments  with  which 
they  are  matched.  Of  course,  the  map  constructor  does 
not  know  beforehand  which  fragments  are  the  end  frag¬ 
ments. 

Potential  programs  have  another  barrier  to  overcome: 
experienced  mappers  can  assemble  an  “average”  map  in 
about  an  hour,  based  on  good  information  about  the  ap¬ 
proximate  order  of  the  clones.  This  apparent  ability  to 
recognize  patterns  in  fragment  sizes  makes  expert  human 
mappers  tough  competitors  to  any  program, 

4  Getting  It  Right 

The  analysis  and  algorithms  which  go  into  a  map  assem¬ 
bly  program  can  be  quite  complex.  For  instance,  the 
integrand  in  Equation  1  involves  several  terms  which  in¬ 
corporate  assumptions  about  the  data  generation  and 
error  contamination  processes,  and  must  be  numerically 
integrated  to  provide  the  numerator  to  the  integrated 
likelihood  ratio.  To  add  to  the  computational  burden, 
Equation  1  must  be  evaluated  over  5  x  10^  times  during 
the  construction  of  a  map.  To  make  the  overall  map  con¬ 
struction  process  feasible,  this  computation  must  fully 
optimized.  The  problem  we  faced  was  how  to  ensure 


that  the  analysis  we  performed  and  algorithms  we  de¬ 
signed  were  fully  specified  and  faithfully  implemented. 

Our  solution  to  this  problem,  arrived  at  only  after 
other,  more  ad  hoc  methods  failed,  was  to  describe  what 
we  wanted  to  do  in  the  very  high-level  language  imple¬ 
mented  by  Spins  (Statistical  Sciences,  Inc.  1991).  This 
specification  could  then  be  tested  and  debugged  by  ex¬ 
ecuting  it  against  sample  data.  After  the  specification 
was  debugged,  it  was  then  reimplemented  in  C  for  speed 
of  execution.  After  the  C  code  was  tested  and  debugged, 
the  answers  it  produced  for  sample  data  could  then  be 
compared  with  the  answers  produced  by  the  specifica¬ 
tion.  Any  differences  represented  bugs  in  the  specifica¬ 
tion,  implementation,  or  both. 

This  simple  method  of  operational  specification 
proved  invaluable  to  us  in  a  number  of  ways.  First, 
it  highlighted  communication  problems  and  definitional 
ambiguities  between  designer  and  implementor.  Prob¬ 
lems  with  defining  exactly  what  a  “match”  meant,  and 
how  it  was  to  be  implemented,  were  quickly  spotted  and 
nailed  down.  Second,  we  found  that  as  often  as  not,  it 
was  the  specification  that  was  ambiguous,  indicating  a 
need  for  further  thought  on  the  part  of  the  designers. 
Third,  having  an  executable  specification  could  guide 
the  debugging  process,  providing  answers  to  partially 
complete  calculations.  Finally,  the  iteration  process  be¬ 
tween  designer  and  implementor  converged  quite  rapidly, 
producing  complex  working  software  much  more  quickly 
than  had  been  possible  in  the  past.  The  technique  was 
so  successful  that  we  now  use  it  on  all  out  software  that 
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has  a  significant  mathematical  or  statistical  component. 

5  Summary 

In  this  paper  we  have  described  several  kinds  of  physical 
maps  and  outlined  the  methods  currently  employed  for 
constructing  these  maps.  These  methods  are  character¬ 
ized  by  being  computationally  intensive,  combinatorially 
complex,  and  sometimes  containing  a  considerable  sta¬ 
tistical  component.  It  is  clear  from  the  descriptions  that 
many  computational  and  statistical  challenges  remain  to 
be  overcome. 

One  important  challenge  that  must  be  addressed  is 
how  to  parallelize  the  computational  burden.  For  some 
tasks,  this  is  easy:  each  of  the  5x  10^  values  of  Equation  1 
is  an  independent  computation.  Given  a  shared  database 
and  a  way  to  communicate  tasks  to  various  workstations, 
parallelization  becomes  a  matter  of  dispatching  compu¬ 
tations  to  free  workstations  and  receiving  the  results. 


The  Human  Genome  Center  at  LLNL  has  implemented 
such  a  scheme  for  its  network  of  over  thirty  worksta¬ 
tions.  However,  for  many  tasks,  such  as  combinatorial 
optimization  using  simulated  annealing,  it  is  not  clear 
how  to  parallelize  the  computation. 

Another  unsolved  issue  is  how  to  combine  informa¬ 
tion  from  various  sources.  The  map  of  chromosome  19  in¬ 
tegrates  information  on  well  over  a  dozen  different  types 
of  probes  and  DNA  regions,  each  with  its  own  size,  probe 
resolution,  and  error  characteristics.  At  the  present 
time,  most  of  this  data  is  treated  democratically,  ignor¬ 
ing  the  special  features  of  each  data  type. 

Finally,  current  procedures  for  constructing  maps 
provide  no  information  about  the  reliability  of  the  re¬ 
sulting  map.  Developing  statistics-based  methods  for 
map  construction  could  provide  a  first  step  towards  as¬ 
sessing  the  uncertainty  of  the  resulting  map  as  well  as 
the  sensitivity  of  the  map  to  features  in  the  underlying 
data. 


References 

Booth,  K.  S.  and  G.  S.  Lueker  (1976).  Testing  for  the  consecutive  ones  property,  interval  graphs,  and  graph  planarity 
using  PQ-tree  algorithms.  Journal  of  Computer  and  System  Sciences  13,  335-379. 

Brown,  T.  A.  (1990).  Gene  cloning:  an  introduction  (2nd  ed.).  Chapman  and  Hall. 

Cohen,  D.,  A.  Chumakov,  and  J .  Weissenbach  (1993).  A  first-generation  physical  map  of  the  human  genome.  Nature 
366,  698-701. 

Green,  E.  D.  and  P.  Green  (1991).  Sequence-tagged  site  (STS)  content  mapping  of  human  chromosomes:  theoretical 
considerations  and  early  experiences.  PCR  Methods  and  Applications  1,  77 — 90. 

Lamerdin,  J.  E.  and  A.  V.  Carrano  (1993).  Automated  fluorescence-based  restriction  fragment  analysis.  BioTech- 
niques  15,  294-300. 

Mardia,  K.  V.,  J.  T.  Kent,  and  J.  M.  Bibby  (1979).  Multivariate  Analysis.  New  York:  Academic  Press. 

Nelson,  D.  O.  and  T.  P.  Speed  (1994).  Statistical  issues  in  constructing  high  resolution  physical  maps.  Statistical 
Science  ,  in  press. 

Newberg,  L,  A.  (1993).  Finding,  evaluating,  and  counting  DNA  physical  maps,  Ph.  D.  thesis.  University  of  California, 
Berkeley. 

Olson,  M.  V.  (1993).  The  human  genome  project.  Proc.  Natl  Acad.  Sci,  USA  90,  4338-4344. 

Statistical  Sciences,  Inc.  (1991).  S-Plns  User's  Manual 


G.R.  Mendieta,  S.  Boneh,  and  R.  Walsh  515 


A  SIMULATION  STUDY  TO  EVALUATE  THE 
PERFORMANCE  OF  A  NEW  VARIABLE  SELECTION 
METHOD  IN  REGRESSION 

by 

Gonzalo  R.  Mendietaf,  Shahar  BonehJ:,  Roxy  WalshJ 

tUniversidad  San  Francisco  de  Quito  JWichita  State  University, 

Quito,  Ecuador  Wichita,  KS  67260-0033 

gonzalo@inail.usfq.edu.ec  boneh@twsuvin.uc.twsu.edu 

walsh@twsuvin.uc.twsu.edu 


ABSTRACT:  The  performance  of  new  stepwise  method 
for  variable  selection  in  regression  will  be  evaluated  in  a 
large  scale  simulation  study.  This  method  is  an  extension  of 
principal  components  regression.  The  on-going  simulation 
is  described.  Preliminary  results  and  previous  tests  on  well 
known  data  sets  show  that  the  method  is  quite  promising 
and  may  worked  better  that  other  methods  in  certain 
situations. 

1.  INTRODUCTION 

Variable  selection  in  regression  is  necessary  when  data 
are  collected  on  a  large  number  of  variables,  often 
correlated,  while  the  goal  is  to  obtain  a  model  with  only  a 
few  predictor  variables.  There  are  many  variable  selection 
methods  commonly  used,  these  include  forward,  backward 
and  stepwise  methods,  or  exhaustive  search  methods  (using 
various  criteria).  While  these  methods  often  yield  good 
outcomes,  they  have  their  shortcomings.  Exhaustive  search 
procedures  may  be  very  costly  or  even  unfeasible  in  large 
scale  problems,  while  systematic  algorithms  may 
sometimes  fail  to  detect  the  best  predictive  subset  of 
variables.  For  a  comprehensive  survey  of  variable  selection 
methods,  we  refer  to  Miller  [10]. 

Principal  component  regression,  a  well  known  and 
effective  technique  for  reducing  the  dimensionality  of  the 
space  of  predictors,  has  the  shortcoming  that  there  is  no 
corresponding  reduction  in  the  number  of  original 
variables.  Jeffers  [4]  was  the  first  to  show  that  principal 
component  analysis  can  be  utilized  to  reduce  the  number  of 
original  variables.  Realizing  that  the  principal  component 
transformation  may  be  more  informative  than  previously 
thought,  more  efforts  were  made  in  this  direction  in  the 
subsequent  years,  most  notably  by  Jolliffe  ([5],  [6],  [7]), 
Hawkins  [2],  and  Mansfield,  Webster  &  Gunst  [9]. 

Recently,  a  new  method  to  select  predictor  variables 
based  on  principal  components  was  proposed  by  Boneh  & 
Mendieta  [1].  The  method  is  stepwise  in  nature,  and  it  is 
based  on  repeated  selections  of  principal  components  and 
inversions  to  the  original  variables.  The  main  idea  of  this 
method  is  to  combine  the  advantages  of  stepwise  selection 


with  those  of  principal  component  regressions.  The  method 
was  tested  on  several  benchmark  data  sets,  and  produced 
good  results.  (An  example  is  given  in  Boneh  &  Mendieta 
[1]).  Having  established  that  the  new  method  is  statistically 
sound,  is  was  called  upon  to  further  study  its  performance. 
In  particular,  to  identity  its  strengths  and  possible 
weaknesses,  and  to  determine  in  what  circumstances  it  may 
be  preferable  to  other  methods. 

The  goal  of  this  paper  is  therefore  to  give  a  brief 
introduction  to  the  method  and  to  report  on  the  design  of  an 
on-going  simulation  study  aim  at  answering  the  above 
questions.  In  Section  2  we  briefly  describe  the  selection 
method,  and  in  Section  3  we  describe  the  layout  and  goals 
of  the  simulation.  Some  general  remarks  are  given  in 
Section  4. 

2.  THE  SELECTION  METHOD 

We  consider  the  standard  linear  regression  model  Y  = 
Xp  +  £,  where  Y  is  an  n  x  1  vector  of  responses,  X  = 
[Xi,...,A’pl  is  an  nxp  full  rank  matrix  of  predictor 
variables,  is  a  p  x  1  vector  of  unknown  parameters,  and 
E  is  an  n  x  1  vector  of  unconelated  and  normally 
distributed  random  errors  with  mean  0  and  common 
variance  Without  loss  of  generality,  all  the  variables  are 
assumed  to  be  standardized  (with  mean  0  and  variance  1). 
Thus  [X^X;  X^Y]  is  the  sample  correlation  matrix. 

We  assume  that  the  reader  is  familiar  with  the  basic 
concepts  of  principal  component  analysis.  Otherwise,  as 
good  references  on  the  subject  we  recommend  Jolliffe  [8]  or 
Jackson  [3]. 

Prior  to  starting  the  selection,  we  select  a  fixed  level  a 
through  out  the  process. 

Step  0:  Selection  of  the  first  variable 
0.1.  Obtain  the  principal  components,  W  =  \Wi, ...,  Wp], 
of  [Xi . Xp]. 

0.2.  Fit  the  model  Y=Wj  +  €,  and  let  W(,)  be  the  subset 
of  W  containing  the  principal  components  for  which 
the  regression  coefficient  7^  is  significant  at  level  a. 
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0.3.  If  W^(s)  is  empty,  the  selection  process  terminates  with 
the  conclusion  that  no  predictor  variables  should  be 
included  in  the  model.  Otherwise,  let  SSEj,  j=l,...,p, 
denote  the  error  sum  of  squares  when  Xj  is  regressed 
on  W(s).  The  first  predictor  selected  is  the  one  for 
which  SSEj  is  minimal. 

Continue  to  select  additional  variables  according 
to  the  following  general  steps: 

Step  I:  Let  yjs)  and  X(r)  be  respectively  the  sets  of  the 
previously  selected  variables  and  the  remaining  unselected 
variables.  Regress  each  variable  in  X(r)  on  all  the  variables 
in  X’(s)  and  obtain  the  corresponding  vectors  of 
standardized  residuals  {Ej ,  j  6  (r)}. 


corresponding  to  the  minimal  SSEj  is  dropped  from  the 
model. 

The  verification  is  then  carried  out  again  to  check  if 
additional  variables  in  X^^)  should  be  dropped.  A  predictor 
variable  that  was  dropped  is  excluded  from  the  pool  of 
potential  variables  in  all  the  future  steps  to  avoid  possible 
cycling  in  the  process. 

The  process  terminates  when  no  principal  components 
have  significant  regression  coefiBcients  in  the  selection  step, 
or  when  the  pool  of  predictor  variables  is  depleted. 

The  method  is  described  in  detail  in  Boneh  & 
Mendieta  [1].  The  following  are  the  main  formulas  used  in 
the  implementation  of  the  above  steps.  Proofs  are  given  in 
[1]. 


Step  H:  Obtain  the  principal  components  W,  of  {Ej}, 

and  regress  Y  on  Jf(s)  and  W. 

Step  ID:  Let  W(g)  denote  the  subset  of  W  containing 
the  principal  components  with  significant  regression 
coefficients  (at  level  a). 

Step  IV:  If  lV(s)  is  empty,  the  selection  process 

terminates.  Otherwise,  let  SSEj,  j  e  (r),  denote  the  error 
sum  of  squares  when  Ej  is  regressed  on  W(s)-  The  next 
variable  selected  is  the  one  corresponding  to  the  minimal 
SSEj. 

After  the  selection  of  each  variable,  the  previously 
selected  variables  are  verified  (essentially  reversing  the 
selection  steps)  as  follows; 


(1)  To  select  principal  components  in  step  0.2  and  the 
general  steps  III  &  VII,  the  hypothesis  Hq;  7^-  =0  is  tested 
by  the  f-test  as  follows: 

Reject  Ho  if  yj  >  tn-p-i,an  ,  where, 

7j  =  ^  V7  E'^Y  and  SSE  =  1  -  (Y'^E)VA-^V'^(E'^Y)  - 
(y^X(g))(X[^)X(s))"*(-X'^s)^-  Here  jE?  denotes  the  matrix  of 
standardized  residuals  of  the  regression  of  the  remaining 
unselected  variables  on  X'(g).  When  selecting  the  first 
variable,  E  is  replaced  by  X  and  is  empty.  Note  that 
the  matrices  V  and  A  are  computed  each  time  from  a 
different  set  of  variables. 


(2)  SSEj,  j  €  (r)  (Step  IV),  is  calculated  by 

E 

k(is) 


SSEj  = 


Step  V:  Let  X*  denote  the  most  recently  selected  variable, 
i.e.,  the  one  which  was  selected  in  the  current  step,  and  let 
X^c)  denote  the  set  of  the  previously  selected  variables. 
Regress  each  of  the  variables  in  X(c)  on  Xk  and  obtain  the 
standardized  residuals  E(c). 


(3)  Let  Bj  be  the  vector  of  standardized  residuals  when 
regressing  Xj  (j  e  (r))  on  (Step  I).  Denote  by  E  = 
{Bj,  j  e  (r)}.  All  we  need  for  the  next  selection  is  E^E 

and  E^Y,  which  are  given  by  E^E  =  ( \ ,  i,j  = 


Step  VI:  Obtain  the  principal  components  W(c)  of  E(c),  and 
regress  Y  on  AT*  and  IV(c). 

Step  Vn:If  all  the  regression  coefficients  of  W(c)  are 
significant  at  level  a,  we  conclude  that  all  the  variables  in 
X'(c)  should  stay  in  the  model. 

Step  Vin:Otherwise,  one  variable  from  X^e)  must  be 
dropped.  To  determine  which  one,  let  W(„)  be  the  subset  of 
W(c)  containing  the  principal  components  with  the  non¬ 
significant  coefficients.  Regress  each  residual  vector  in  E^c) 
on  W^„),  and  obtain  SSEj,  j  e  (c).  Thevariable  in  X(c) 


2,...,p,  and  E'^Y=  \^-^j,j  =  2,...,p, 

where,  Aj  =  (XjXj)  -  (XjX(,^)(Xl^X^,^y\Xl^Xj) 
and  Bj  =  (Y'^Xj)-(Y'^X^,^)(Xl^X^s)yHXl^Xj). 

An  important  feature  that  emerges  from  Formulas  (1)- 
(3)  is  that  the  method  can  be  carried  out  with  the 
correlation  matrix  only,  without  direct  use  of  the  raw  data. 
This  feature  enhances  the  computational  efficiency  and 
convenience  of  the  algorithm. 
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3.  LAY  OUT  OF  THE  SIMULATION 
EXPEDDVfENT 

In  this  section  we  describe  the  layout  of  our  simulation 
study.  In  designing  our  simulation  experiment  we  made 
used  of  some  results  regarding  the  design  of  simulation 
experiments  in  regression  given  in  [11], 

Generation  of  the  data: 

All  data  sets  to  be  considered  in  this  study  will  be 
generated  from  a  normal  distribution  with  mean  0  and 
covariance  given  by  the  (p+1)  x  (p+1)  matrix 
r  -  {  Px  Pxy\ 

\PXY  1  )' 

Several  types  of  correlations  px  between  the  predictors 
will  be  considered.  The  correlations  />xy  between  the 
response  and  the  predictors  are  such  that  they  correspond  to 
particular  specifications  of  the  slopes  in  the  model 
Y  =  XB  +  £. 

Factors  to  be  considered: 

Our  simulation  layout  corresponds  to  a  factorial 
experiment  with  the  following  factors: 

1.  Number  of  predictors  in  the  model:  We  will  consider 
models  with  4  and  8  predictors. 

2.  Number  of  predictors  with  non-zero  slope:  We  will 
consider  models  with  1  and  2  predictors  with  non-zero 
slopes. 

3.  Sample  size:  50  and  100  data  points. 

4.  Type  of  correlation  structure  between  the  predictors: 
We  will  study  correlations  of  the  form 

Px-(^  ^  ^  where,  I  is  the  identity  matrix  of 

order  p  -  g  and  A  is  one  of  the  following  matrices: 

Equi-correlation, 


5.  Values  of  the  parameters:  We  will  set  o'=l  and  the 
values  of  the  non-zero  slopes  will  be  selected  in  such  a 
way  that  the  0.10  /-test  for  testing  the  hypothesis  that 
in  the  model  Y  =  XjB  -I-  ^  has  an  approximate 
power  of  .90  and  .99.  The  corresponding  values  of  .43 
and  .62  for  a  sample  size  of  50,  and  .31  and  .45  for  a 
sample  size  of  100  can  be  obtained  from  results 
reported  in  [11] 


For  each  of  these  factor-level  combinations  a  total  of 
1000  different  data  sets  will  be  generated.  Both,  our 
algorithm  and  the  standard  stepwise  algorithm  as 
implemented  in  S-plus  will  be  run  and  analyzed. 


4.  MEASURES  OF  PERFORMANCE 

The  performance  of  the  algorithms  will  be  evaluated 
using  the  following  measures: 

1.  Mean  Square  Error  of  Prediction  :  For  each  model  a 
prediction  data  set  consisting  of  100  data  points  from 
the  true  model  will  be  generated.  The  quantity 


f  \  p  p  p\ 
P  ^  P  p\ 
P  P  I  P 
\P  P  P  i/ 


Markovian, 


will  be  computed.  Here  Yj  is  the  predicted  response 
computed  from  the  selected  model,  and  jjLj  is  the  true 
response  at  the y-th  observation. 

2.  Proportion  of  times  the  correct  model  is  selected, 

3.  Proportion  of  time  each  of  the  predictors  with  non-zero 
slope  was  included  in  the  final  model, 

4.  The  mean  number  of  noise  variables  selected  in  the 
final  model. 

In  addition  such  numerical  performance  measures  as 
speed  and  number  of  iteration  will  also  be  measured. 


Equi-predict, 
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Abstract 

In  this  paper,  first  we  use  saddlepoint  methods 
to  approximate  the  density  of  an  estimator 
for  a  p-dimensional  parameter  0,  that  is  given 
implicitly  as  the  solution  of  p  nonlinear  equa¬ 
tions  Er=i  =  Oj  J  Here, 

the  Xi  are  i.i.d.  random  variables  with  density 
f{x,0)  and  ipj  is  a  nondecreasing  function  sat¬ 
isfying  certain  mild  regularity  conditions.  The 
one-dimensional  case  was  treated  by  H.  Daniels 
(Biometrika,  1983). 

Then,  we  find  saddlepoint  approximations 
for  the  densities  of  the  least  squares  and  the  ro¬ 
bust  M  estimator  of  regression  Bm-  For  the  lin¬ 
ear  regression  model  j/,-  =  xJ^  +  Ci,  Bm  satisfies 
the  system  of  equations  Xi‘^{yi- xj P)  =  0 

where  xf  is  a  p-dimensional  row  vector  and  /? 
is  a  p-dimensional  column  vector.  If  ^(u)  =  u 
the  least  squares  is  obtained. 

1  Introduction 

Let  Xi,X2,‘  •  -iXn  be  i.i.d.  random  variables 
with  density  function  f{x),  and  generating  mo¬ 
ment  function,  M{t)  =  which 

converges  for  each  real  t  in  the  interval  (ci,C2) 
that  contains  zero. 

Let  /n(x)  the  density  function  of  the  sample 
mean  x  =  Aj/n.  H.  Daniels  (1954),  ap¬ 
plied  saddlepoint  methods  of  asymptotic  analy¬ 
sis  to  find  /„(x)  =  5n(x)[l  +  0(i)]  where, 

77  1/2 

s.(5E)  =  «=‘P(»(^^(ro) - ro3E)) 

(1.1) 


is  called  the  saddlepoint  approximation  to  the 
density  fn{^)-  Here  K{T)  =  log  M{T)  is  the  cu- 
mulant  generating  function,  and  To  is  the  sad¬ 
dlepoint,  i.e  K'{Tq)  =  X.  Since  £fn(x)  does  not 
integrate  to  1,  sometimes  a  renormalized  sad¬ 
dlepoint  approximation  is  used. 

The  Saddlepoint  approximation  improves  the 
one  given  by  the  two-term  Edgeworth  expansion 
for  /n(x),  which  can  give  negative  values  for  x 
values  far  away  from  the  mean  fi,. 

The  Saddlepoint  approximation  can  also  be 
obtained  by  using  a  conjugate  family  of  den¬ 
sities  for  f{x)  defined  by  f{x,  A)  =  exp(Aa;  — 
K{X))f{x),  which  has  fix  =  K'{X)  and  variance 
<t|  =  K''{X).  Notice  that 

/n(x)  =  /(x.  A)  exp(n[A(A)  -  Ax])  (1.2) 

Then,  using  an  Edgeworth  expansion  for  /(x.  A) 
at  its  center  and  letting  A  =  To?  we  obtain  the 
saddlepoint  approximation  (1.1).  This  proce¬ 
dure  is  called  Tilted  or  indirect  Edgeworth. 

0.  Barndorff- Nielsen  and  D.  R.  Cox  (1979), 
extended  Daniel’s  result  to  multivariate  den¬ 
sities.  Let  Xi,-’‘,Xn  be  p-dimensional  ran¬ 
dom  vectors  with  cumulant  generating  function 
K(T)  where  T  is  in  R^.  Then,  the  saddlepoint 
approximation  to  the  density  function  /n(x)  of 
the  p-dimensional  mean  x  is 

y^^^n(K(To)-n^ 
(1.3) 

where  To  is  the  p-dimensional  saddlepoint,  Tq 
its  transpose  and  |A'"(ro)|  is  the  determinant 
of  the  matrix  K"{To)  — 
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2  Saddlepoint  methods  for  M- 
estimators 


Let  X  be  a  random  variable  with  density  func¬ 
tion  f{x,0),  where  0  is  an  unknown  parame¬ 
ter.  An  M-estimator  0n  of  0  based  in  a  random 
sample  Xi ,  •  •  • ,  is  obtained  by  solving  with 
respect  to  t 

=  0  (2.1) 

i=l 

where  V’  is  a  nondecreasing  function. 

Notice  that  for  tp(x,t)  =  x  —  t  we  obtain  the 
sample  mean  x,  and  for  i){xf)  =  df(x,t)/dt 
we  obtain  the  ML  estimator  of  0.  If  'tp{u)  is 
bounded  then  the  estimator  is  said  to  be  Ro¬ 
bust,  for  instance  for  =  min(fc,  max(u,  —k)) 
we  obtain  the  Huber  estimator. 

H.  Daniels  (1983)  found  the  saddlepoint  ap¬ 
proximation  to  the  density  of  fn(0n),  where  0n 
solves  the  equation  (2.1).  At  the  point  0n  =  a, 
fn  is  approxinated  by 


9n{a)  =  { 


n 


^l/2pX*(ro,  a)^„A-(ro.a)j 
To 

(2.2) 

where  K{T,  a)  is  the  cumulant  generating  func¬ 
tion  of  =  V’(^)fl)»  here  a  is  a  fixed  value, 
K'(To,a)  =  0,  and  K*  represents  the  derivative 
of  K  with  respect  to  a. 

Now  let  us  treat  the  multiparametric  case. 
Here  0  is  in  and  the  M-estimator  0n  is  the 
solution  (in  t)  of  the  system  of  p  nonlinear  equa¬ 
tions 


2TrK%To,a) 


=  0,  y  =  l,--*,p  (2.3) 

t=i 

Let  us  consider  the  random  vector  ip{x,aL)  = 
(^1,.  From  now  on  consider  the  none¬ 

gative  integral  vector  v  =  (vi,'  •  •  ,Vp).  Fur¬ 
ther  write  |v|  =  =  v\\'''Vp\  and 

D"  =  (J9i)”i  •••(Dp)"'’  for  the  v-th  derivative 
with  respect  to  0  . 

If  the  following  conditions  are  satisfied  (see 
Huber  (1981)  pg.  132): 

Al.  Esi[ip{x^a.)]  =  0 


A2.  Xa[||t^(a:,a)||^  <  oo  and  there  exists 
an  c  >  0  such  that 

ll-DV’(®,<?)lP]  <  oo 

||e-a||<£ 

A3.  The  matrices  A  =  {Ea.Dr'il’j(x,si))i<r,j<p 
and  C  =  COV[V’(x,a)]  =  Da[V’»(»»a)V’j(a;,a)] 
are  nonsingular. 

Then  T„  =  y/n{0n  -  a)  has  a  limiting  p- 
variate  normal  distribution  with  mean  0  and 
dispersion  matrix  A~^C{A~^)^ . 

Replacing  the  condition  A2  by 
A2’.  Da[i!D"V’(a:,a)||^]  <  oo  for  |u|  =  1,2 
and  there  exists  an  e  >  0  such  that 

Xa[  max  ||D"V’(a:,6l)|p]  <  oo 
||«-a||<£ 

if  |v|  =  3  for  j  =  !,•  •  •,p. 

The  two  term  Edgeworth  expansion  for  the 
distribution  function  of  y/n{0n  -  a)  can  be  ob¬ 
tained  from  the  theorem  3  of  Bhattacharya  and 
Ghosh  (1978,  page  440).  Thus 

P{Vn{0n-a.)  e  B)=  f  [l+n~'^^‘^Pi{x)]<i>M{x)dx 

J  B 

+o(n-^^^)  (2.4) 

uniformly  in  B  that  belongs  to  the  Borel  system 
of  Here  <I>m  stands  for  the  p- variate  nor¬ 
mal  density  with  mean  0  and  covariance  matrix 
M  =  A~^C{A~^)'^ .  Also  Pi{x)  is  a  polynomial 
not  depending  on  n  whose  coefficients  are  them¬ 
selves  polynomials  on  the  moments  of  order  3  or 
less  of  ^(x,a). 

Let  us  consider  the  conjugate  family  of  den¬ 
sities  for  f{x)  given  by 

/(a:.  A)  =  f{x) 

where  A'^(A,a)  is  the  cumulant  generating  func¬ 
tion  of  the  random  vector  iP{x,sl).  Notice  that 
Exmx,a)]  =  /i^(A,a)  and  COVAWx,a)]  = 
K'^{X,bl).  It  is  easy  to  prove  that 

4(a)  =  e"''*'^'“)4(«,A)  (2.5) 

Notice  that  fg{w,X)  =  n^^^fT„{y/n{w  —  a).  A). 
Therefore  f§J&,X)  =  n'’/^/x„(0,  A).  Choosing 
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A  =  To  such  that  A";(To, a)  =  0,  it  follows  from  3  Saddlepoint  approximations 
2.4  that  for  estimators  of  regression 

f  lO  A1  = _ ^ _  f2  6) 

/T„t  ?  )  (27r)P/2|C|^/^  '■  Let  us  consider  the  multiple  regression  model 


where  |A|  is  the  determinant  of  the  matrix  A 
which  is  computed  under  the  conjugate  distri¬ 
bution,  and  C  =  K'^{To,a.)  where  To  denotes 
the  p-dimensional  saddlepoint.  Finally  turns 
out  that  the  saddlepoint  approximation  to  the 
density  function  of  On  at  the  point  a  is 

«»(»)  = 

(2.7) 

Usually  the  saddlepoint  has  to  be  computed 
numerically  over  a  grid  of  values  a.  An  equiva¬ 
lent  result  to  (2.7)  has  been  obtained  by  Field 
(1982),  who  also  shows  some  numerical  exam¬ 
ples. 

Example  1.  Location  and  scale  estimation  in 
a  normal  density 

Let  us  consider  a  Normal  randon  variable  X 
with  mean  fx  and  standard  deviation  a,  both 
of  them  unknown.  Let  a  =  (a,  6)  where  a  and 
b  are  the  least  squares  estimators  of  fi  and  cr 
respectively.  In  this  case  V’i(a:,a)  =  and 

^2(®,a)  =  -  1. 

After  long  computations  we  obtain 

Tcfr  f  i  I 

K{T,a)  =  -t2  +  — — ^2  + - j 

a'^{2{e-a)hAhhf  h 

+  262(62  -  2(tH2)  2  ^^62  _  2aH2^ 


The  saddlepoint  is  Tq  =  (- 
1  62 

K{To,a)- 


^  62-it2 


).  Also 


-^  +  log(-) 


|A"(To,a)|  =  2  and  jA]  =  Then,  the  saddle¬ 
point  approximation  is  given  by 


/•  x_  « 

^"^^^■7rv/262'^a2^ 


+  n  no 
2~'^ 


yi  =  xJl3-\-ei  i=l,---,n  (3.1) 

where  ei,  •  •  • ,  e„  are  i.i.d  random  variables  with 
common  distribution  F;  xf,‘-‘,Xn  are  known 
nonrandom  p-dimensional  row  vectors  and  j5  is 
the  p  X  1  vector  of  unknown  parameters.  We 
will  use  also  the  following  notation: 

X  —  (x'[,---,Xn)  represents  a  design  ma¬ 
trix  and  X'  its  tranpose.  Notice  that  X*X  = 

H  =  X{X'X)~^X'  is  a  projection  matrix 
with  diagonal  element  ha. 

Next  we  will  discuss  the  saddlepoint  approx¬ 
imation  for  the  density  of  the  least  squares  es¬ 
timator  of  regression,  which  is  based  in  the  fact 
that  can  be  expressed  as  a  linear  combination 
of  the  e\s.  Later  we  will  treat  the  case  of  the 
M-estimator  of  regression. 

3.1  Saddlepoint  approximation  in  least 
squares  regression 

The  least  squares  estimator  yS  of  /3  in  the  regres¬ 
sion  model  (3.1)  is  given  by  ^  =  (Af'A)  ^X'Y . 

Huber  (1981,  pg.  159)  proved  that  under 
the  following  conditions 

Bl.  e(s  are  i.i.d  with  mean  0  and  finite  vari¬ 
ance  (T^, 

B2.  X  has  full  rank  p. 

B3.  maxi<i<phii  — >  0 

Then  T„  =  {X'X  fl\^  -  has  a  limiting 
p-variate  normal  distribution  with  mean  0  and 
dispersion  matrix  a^Ipi  where  Ip  is  the  identity 
matrix  of  order  p. 

Under  the  conditions  Bl,  B2  and  the  ones 
given  below 

B3’.  e\s  have  finite  s-th  absolute  moment, 
for  some  integer  s  >  3  and,  lim^  I  let'll®  < 

(X)  . 

B4’.  UmAn/n  >  0  and  M„  =  O(n^)  for 
some  8  G  [0, 1/2).  Here  A„  =the  smallest  eigen¬ 
value  of  X’X  and  Jkf„  =  maxi<j<,i  INi!l* 


which  results  to  be  exact  except  for  the  con¬ 
stant. 
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Qumsiyeh  (1990)  obtained  the  following  two- 
term  Edgeworth  expansion  for  the  density  func¬ 
tion  of  T„ 

=  O(n-i)  (3.2) 

uniformly  in  a;  G  RP.  Here  4)^2  represents  the 
p-variate  normal  density  with  mean  0  and  co- 
variance  matrix  a'^Ip.  Also  P\{-D,{x^})(f>^2  = 

-'L\v\=z^D'>(t)„2{x)  and,  =  ^Ei=iXv(Zi) 
where  Xv(Zi)  denotes  the  v-th  cumulant  of  Z,-  = 
'n}l'^[X'X)~^l‘^Xiei,  which  are  independent  with 
mean  0. 

Now  let  us  derive  the  saddlepoint  apprxima- 
tion  to  the  density  of  Tn.  Let  r„  = 
being  di  —  {X'X)~^l'^Xi  a  p  x  1  vector.  Then 
=  Er=i  (d^t),  where  ifei(-)  stands 
for  the  cumulant  generating  function  of  the  e- 
rror  ej.  Also  (d(t)  and 

dj/f"  (d(t)d(-.  On  the  other  hand 

/T.(a)  =  e'^»-.W-«/T.(a,A)  (3.3) 

Notice  that  E\[Tn]  =  iff  (A).  Also  COV[T'„]  = 

Choosing  A  =  to  such  that  A'f^(to)  =  a 
then  from  (3.2)  and  (3.3)  the  saddlepoint  ap¬ 
proximation  to  the  density  of  Tn  at  the  point  a 
is  as  follows 


3.2  Saddlepoint  approximation  for  M 
regression 

Let  V’  a  nondecreasing  and  bounded  real- valued 
function,  then  an  M-estimator  Bm  of  ^  corre¬ 
sponding  to  ^  is  defined  as  the  solution  (in  t) 
of  the  vector  equation 

n 

xjt)  =  0 

»=1 

It  is  well  known  (Huber,  1981,  pg.  165)  that 
under  the  following  conditions  on  the  error  dis¬ 
tribution  F,  Ip  and  the  design  matrix  X: 

Cl.  ^  is  twice  differentiable  and  the  second 
derivative  tp"  satisfies  a  Lipschitz  condition  of 
order  u  for  some  0  <  2a;  <  1. 

C2.  E{4^{er))  =  0,  and  e 

(0,  oo) 

C3.  Xix\  is  invertible  for  some  n  >  p. 

Then  Tn  =  (Efci  Xixjf^BM  - /3)  has  a, 
limiting  p-variate  normal  distribution  with  mean 
0  and  dispersion  matrix  r^Ip,  where  Ip  denotes 
the  identity  matrix  of  order  p. 

Write  q  =  p{p+l)/2  and  for  each  d,-  = 
(d,i,  •  “^dipY  define  the  qxl  vector  cf  = 

^iid'i2i  ’  ‘ "  5  djjdjp,  dj’2,  d{2di3j  •  •  • ,  d{2dip]  •  •  *  j  d^). 

The  spectral  decomposition  of  the  real  sym¬ 
metric  matrix  Ya=\  Cjcf  yields  a  gxg  nonsigular 
matrix  B  of  rank  r  such  that 


9n{d) 


0 

0 


Example  2  Normally  distributed  errors 

In  this  case  Keft)  =  and 

(0  =  Since  d,dj  =  /,  then  to  =  ^ 
also  A'f^(to)  =  a'^Jp  and  /fx’„(to)-t(,a  =  -0 
yielding  the  saddlepoint  approximation 


Let  B'  =  [B1IB2]  where  Bi  is  of  order  r  x  q. 
Define  the  column  vector  bi  by  6j  =  J9ic,-  for 

1  <  i  <  71. 

Let  7n  =  (Ei  +  (Ei 

For  6  >  0,  define  A„(<5)  =  {?!  :  1  <  i  < 
n,(d(ti)2  -I-  (bf2f  >  hi  for  aU  h  G  BP  and 
t2€iZ’-with  lltillH  |NP  =  1}. 

Consider  the  following  two  additional  condi¬ 
tion: 

C4.  7n  ~  ®(1)* 

C5.  There  exists  >  0  such  that  = 

0(1)  where  K{P)  -  #[A(^)]. 


which  results  to  be  exact. 
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Under  conditions  C1-C5,  Lahiri  (1992b)  has 
found  the  following  two-term  Edgeworth  expan¬ 
sion  for  the  distribution  fumction  of  T„ 

P{Tn  €  jB)  =  /  (1  +  Pi{F,  x))^r{x)dx  -b  o(7„) 

(3.5) 

uniformly  in  B  that  belongs  to  the  Borel  system 
of  R^.  Here  d>T  stands  for  the  p- variate  variate 
normal  with  mean  0  and  covariance  matrix  r^/p 
Pi(F,  a:)  is  a  polynomial,  whose  coeeficients  are 
continuous  functions  of  the  finite  moments  of 
V>(ei),  ^'(ei)  and  V’"(ei). 

Now  let  us  obtain  an  approximated  saddle- 
point  approximation  for  the  density  function  of 
T„. 

Using  equation  (3.12)  from  Lahiri ’s  paper 
(1992b,  pg.  1560)  we  can  write 

n 

r„  =  a-i^di^(e.)  +  Pi„  (3.6) 

t=i 

where  a  =  E[ip\ei)\  and  R\n  is  a  remainder. 

Using  (3.6)  we  can  approximate  the  cumu- 
lant  generating  function  of  Tn  as  is  suggested  for 
Easton  and  Ronchetti  (1986).  Thus  K'r„{t)  fa 
Er=i  where  /i'^(ei)(-)  stands  for 

the  cumulant  generating  function  of 

Abo  “>'1 

Evaluating  the  above  expressions  at  the  sad- 
dlepoint  to  and  replacing  them  in  (3.4)  we  ob¬ 
tain  an  approximated  saddlepoint  approxima¬ 
tion  for  the  density  of  Tn- 
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Abstract.  The  cumulative  sum  (CUSUM)  technique  is  widely  used  in  industrial  quality  control  to  detect  a  small 
change  in  the  mean.  A  shortcoming  of  the  CUSUM  technique  is  that  it  is  not  robust,  namely,  very  sensitive  to  a 
few  wild  observations.  A  nonparametric  version  of  the  CUSUM  method  based  on  the  rank  statistics  and  its 
standardization  is  proposed  which  is  more  robust  than  the  original  CUSUM.  Two  examples  are  used  to  illustrate 
the  proposed  test. 

1.  Introduction  (or  U2'  ==  -  niin2iqi„,i{U2,q}  (2)) 

Many  production  processes  subject  to  external 

stimulants  may  result  in  a  change  such  that  the  mean  of  where  Uj^^  and  U2,q  are  given  by 
the  process  deviates  from  the  specified  target  value.  It  is 

important  that  such  a  deviation  could  be  detected  as  early  Uj^q  =  q(Ri  -  (n+ 1)/2)  (3) 

as  possible.  The  CUSUM  technique  proposed  by  Page  [2-  and 

4]  was  more  powerful  for  detecting  a  small  change  in  the  Uj^q  ~  Uj  q/(q(n“q)(n+ 1)/12)^^  (4) 

mean  level  of  a  continuous  pocess  than  the  conventional 

Shewhart’s  control  chart.  Therefore,  it  is  widely  used  in  Rreject  Hq  for  large  values  of  Uf  =  -  minj^q^n.! 
the  area  of  quality  control.  See,  for  example,  Bissell  [1]  {Ui,q}  or  U2‘  =  -  {U2,q}  (or  reject  Hq 

for  a  review.  However,  a  shortcoming  of  the  CUSUM  for  large  values  of  U|‘*’  =  n^X2sqii»ri{U|,q}  or 

method  is  that  it  is  not  robust,  namely,  very  sentive  to  a  U2^  =  maX2sqin.i  {U2,q})- 

few  wild  observations. 

To  overcome  this  shortcoming,  a  nonparametric  For  two-sided  alternative  H,  :  d^O,  reject  Ho  for 

version  of  the  CUSUM  method  and  its  standardization  large  values  of  U|*  =  max2sq^o-i  { I  I }  = 

based  on  the  rank  statistics  is  proposed.  Although  other  max{Uf,  U/}  or  Uj*  =  niax2iq^tt.i{|U2,ql}  = 

nonparametric  tests  were  considered  before  (McGilchrist-  max{U2‘,  U2*^}.  Also,  an  estimate  of  the  unknown 

Woodyer  [2],  Pettitt  [6],  Wolfe-Schechtman  [8]),  the  one  change-point  is  given  by  k  which  satisfies  the 

proposed  here  has  advantages  that  it  is  easier  to  following 

implement  computationally  and  can  be  visualized 

graphically  for  the  slope  change  between  sucessessive  1 U2^ \  =  max2^qin.i{  |  U2,q | }  (5) 

points  in  the  sequential  plot  of  the  rank  cusum  as 

characterized  in  the  original  CUSUM  method.  Upon  a  closer  examination,  both  Uj  q  and  U2,q  are 

noticed  to  be  of  type  of  the  Wilcoxon  test.  Due 

2.  Rank  CUSUM  Test  to  the  inherent  nature  of  the  Mann-Whitney- 

Let  {Xi},  i=l,...,  n  with  n  being  given,  be  a  Wilcoxon  statistics,  Uj*  and  U2*can  be  shown  to 

sequence  of  independent,  continuous  random  variables  be  equivalent  to  the  statistics  Kj  of  Pettitt  [6]  and 

such  that  Xj,  j  =  l,...,  k,  has  a  probability  distribution  V  of  Schechtman  [7].  Although  they  are 

F(x),  and  Xj,  ,j=k+l,...,  n,  has  a  probability  distribution  equivalent,  the  statistics  Uj*  and  U2*  are  more 

F(x-6),  where  both  k  and  6  are  unknown  with  2^k<n  -1  convenient  to  compute  than  and  V  because  it 

and  -  00  <6 <  00.  The  integer  k  is  called  the  change-point  only  requires  ranking  n  observations.  In  contrast, 

and  6  the  magnitude  of  change.  We  consider  the  problem  both  Kj  and  V  based  upon  the  Mann- Whitney 

of  testing  the  null  hypothesis  of  no  change,  Ho  :  6=0,  counting  form  require  the  computation  of  q(n-q) 

against  the  alternative  of  change,  Hj  :  6  >  0  (or  6  <  0,  or  differences  which  can  become  unmanageable 
StsO)  .  even  for  moderate  values  of  q  and  n  -  q. 

Let  Rj  be  the  rank  of  Xj  in  the  ordered  sequence  of  Also,  note  that  the  correct  starting  value  of 

X(,)<Xo)< ...  <Xo,).  For  testing  Ho  :  8=0  against  H|  :  8  the  index  q  should  be  from  2  rather  than  from  1 

>0  (or  8<0)  ,  the  rank  version  of  the  CUSUM  and  its  as  used  in  both  Pettitt  [6]  and  Schechtman  [7]. 

standardization  are  defined,  respectively,  by  The  justification  is  that  neither  k=n  nor  k=l,due 

to  symmetry,  can  be  regarded  as  a  change-point. 

(1)  Another  reason  is  that  at  least  two  points  are 
needed  to  estimate  the  slope  in  the  rank  cusum 


U,-  =  - 
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plot  as  given  in  Section  3. 

Small  sample  null  distributions  of  \52  can  be 
obtained  by  evaluating  the  testing  statistics  Uj*  for  all 
possible  arrangements  of  the  appropriate  ranks  for  a 
sample  of  n  observations.  A  C-program  has  been  designed 
for  the  personal  computer  to  generate  the  exact  null 
distribution  for  any  sample  size.  However,  due  to  the 
limitation  of  computing  speed  and  the  storage  constraint 
of  the  memory  space  of  the  personal  computer,  only  the 
critical  values  of  the  null  distribution  of  sample  size  n  = 
6,  7,  8,  9,  10,  and  11  are  given  in  Table  1. 

Table  1  Critical  values  &  exact  significance  levels  of  Uj* 


Nominal 

n 

a=.10 

o=.05 

a=.01 

6 

1.96  .10 

7 

2.12 .10 

8 

2.24  .08 

2.31  .03 

9 

2.20  .08 

2.45  .03 

10 

2.17  .10 

2.40  .03 

2.61  .008 

11 

2.25  .09 

2.46  .03 

2.74  .008 

3*  Application 

hi  practical  applications,  the  interest  is  often  aimed 
at  the  estimation  of  the  unknown  change-point  k  if  such  a 
change  has  occurred.  Just  like  the  original  CUSUM,  the 
emphasis  is  on  plotting  the  rank  cusum  or  against 
q  and  the  change-point  can  be  visualized  vividly  at  the 
point  where  the  slope  between  successive  points  has 
changed  dramatically.  Two  examples  are  given  to 
illustrate  the  use  of  the  rank  cusum  test.  A  C-program 
written  to  implement  the  rank  cusum  plot  is  available 
upon  request  from  the  author. 

Example  !•  The  data  given  in  the  second  row  of  Table  2 
is  taken  from  Pettitt  [6]  which  are  some  industrial  data 
representing  the  percentage  of  a  particular  material  in  27 
batches  taken  from  a  given  source. 

To  demonstrate  the  lack  of  robustness  of  the 
CUSUM  method,  the  cusum  calculated  from  the  formula 
q(Xi  -  A),  where  A  is  the  sample  mean  of  all  27 
observations,  is  given  the  sixth  row  of  Table  2.  The  third, 
fourth  and  fifth  rows  of  Table  2  represent  the  values  of 
R^,  Uj^q  and  Uj,,,  respectively.  As  can  be  seen  from  Fig. 
1(a),  die  "spike"  value  of  occurs  at  q  =  7,  which 
reflects  the  undue  effect  of  the  wild  observation  of  Xg  = 
17.7,  and  certainly  is  not  a  satisfactory  estimate  of  the 
change-point.  An  experienced  analyst,  the  cusum  plot  of 
is  highly  varied  before  q  =  16,  it  is  much  less  varied 


after  q  =  16;  hence  the  mean  has  probably 
changed  at  q  =  16.  Both  of  and  U2*  give  the 
estimate  of  the  change-point  k  =  16  since  the 
slope  has  changed  dramatically  there  (Fig.  1(b)- 

(c)). 


Table  2  The  value  of  X^,  R^,  U,^,,  U2^q,  and  S,. 


q 

1 

2 

3 

4 

5 

X, 

7.1 

8.1 

8.2 

11.1 

6.6 

R, 

8 

12 

14.5 

25 

5 

u... 

-6 

-8 

-7.5 

3.5 

-5.5 

U2., 

-0.07 

-0.74 

-0.58 

0.24 

-0.34 

s. 

-1.33 

-1.66 

-1.89 

0.78 

-1.05 

q 

6 

7 

8 

9 

10 

X, 

4.9 

4.0 

17.7 

6.5 

4.6 

R, 

3 

1 

27 

4 

2 

u... 

-16.5 

-29.5 

-16.5 

-26.5 

-38.5 

U2., 

-0.96 

-1.63 

-0.88 

-1.36 

-1.93 

s. 

-5.58 

-9.01 

0.26 

-1.67 

-5.7 

q 

11 

12 

13 

14 

15 

X, 

8.8 

11.6 

6.8 

7.5 

6.9 

R, 

17 

26 

6 

9.5 

7 

u... 

-35.5 

-23.5 

-31.5 

•36 

-43 

-1.75 

-1.15 

-1.53 

-1.75 

-2.1 

S, 

-5.33 

-2.16 

-3.79 

-4.7 

-6.24 

q 

16 

17 

18 

19 

20 

X, 

8.1 

9.3 

7.5 

10 

8.7 

R, 

12 

21 

9.5 

24 

16 

u... 

-45 

-38 

-42.5 

-32.5 

-30.5 

U2., 

-2.22 

-1.91 

-2.19 

-1.73 

-1.69 

s. 

-6.57 

-5.7 

-5.63 

-4.06 

-3.97 

q 

21 

22 

23 

24 

25 

X, 

9.1 

8.9 

9.1 

9.6 

8.1 

R, 

19.5 

18 

19.5 

22 

12 

u... 

-25 

-21 

-15.5 

-7.5 

-9.5 

Ui., 

-1.46 

-1.31 

-1.06 

-0.58 

-0.88 

S, 

-3.12 

-2.65 

-1.98 

-0.81 

-1.14 

q 

26 

27 

X, 

9.8 

8.2 

R, 

23 

14.5 

-0.5 

0 

U2., 

-0.06 

undefined 

s. 

0.23 

0 
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Fig.  1  The  plot  of  cusum  and  its  rank 
counterparts 


Example  2,  The  data  used  here  is  taken  from  Sen- 
Srivastava  [8],  namely,  the  Illinois  traffic  data.  After 
applying  the  rank  cusum  test  (Fig.  2),  the  estimated 
change-points  of  deaths  and  injuries  are  the  same,  i.e., 
x=1965  with  Uf =2.56,  while  k=1966  (Uf=2.61)  and 
j^==1967  (U/=2.45)  are  the  change-point  estimates  for 
the  data  of  accident  and  death  rate.  Clearly,  the  mean 
level  of  the  deaths,  injuries  and  accidents  was  increased. 
Only  the  mean  level  of  the  death  rates  was  decreased. 
Also,  note  that  all  changes  are  significant  at  the  level  of 
0.05  (Table  1  with  n=  10). 


4.  Concluding  Remarks 

In  this  paper  we  have  presented  two  nonparametric 
tests  which  are  the  rank  version  of  the  traditional  CUSUM 
technique.  Through  an  example  the  rank  cusum  test  is 
demonstrated  to  be  more  robust  than  the  traditional  cusum 
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Fig.  2  The  rank  cusum  plot  of  the 


Illinois  traffic  data 


AcknoM^ledgements 

The  computation  of  Table  1  assisted  by  Mr.  Lei 
Zhang,  a  graduate  assistant  in  the  Department  of 
Mathematics  of  Western  Illinois  University,  and  the  plot 
of  Fig.  1  by  Mr.  Pete  LaPorta  of  Armstrong  Laboratory 
are  gratefully  acknowledged. 


References 


1.  Bissell,  A.F.  (1969).  Cusum  techniques  for 

quality  control.  Appl.  Statist,,  18,  1-30 

2.  McGilchrist,  C.A.  and  Woodyer,  K.D. 

(1975).  Note  on  a  distribution-free  cusum 
technique.  Technometrics,  17,  321-325. 

3.  Page,  E.S.  (1954).  Continuous  inspection 

schemes,  Biometrika,  41,  100-114. 

4.  _ (1955).  A  test  for  a  change  in  a 

parameter  occurring  at  an  unknown  point. 
Biometrika,  42,  523-527. 

5.  _ (1957).  On  problems  in  which  a  change 

in  parameter  occurs  at  an  unknown  point. 
Biometrika,  44,  248-252, 

6.  Pettitt,  A.N.  (1979).  A  non-parametric 

approach  to  the  change-point  problem. 
Appl.  Statist.,  28,  126-135. 

7.  Schechtman,  E.  (1982).  A  non-parametric  test 

for  detecting  changes  in  location.  Comm. 
Statist.,  All,  1475-1482. 

8.  Sen,  A.  and  Srivastava,  M.S.  (1975).  Some 

one-sided  tests  for  changes  in  level. 
Technometrics,  17,  61-64. 

9.  Wolfe,  D.A.  and  Schechtman,  E  (1984). 

Nonparametric  statistical  procedures  for  the 
changepoint  problem.  J.  of  statist.  Plan, 
and  Infer.,  9,  389-396. 


method.  In  practice,  the  standardized  rank  cusum  U2  is 
recommended  over  the  ordinary  rank  cusum  Ui"*  because 
finite  sample  null  distribution  of  Uj*  is  already 
constructed,  but  not  of  Uj*.  In  addition,  the  proposed  test 
appears  not  limited  to  the  change-point  problem  of  having 
at  most  one  change.  If  it  is  visualized  to  have  more  than 
one  change-point  from  the  rank  cusum  plot,  all  we  have 
to  do  is  to  split  the  data  set  into  two  subsets  using  the  first 
change-point  estimate  as  a  dividing  point  and  then  apply 
the  rank  cusum  test  to  each  of  the  two  subsets. 

Evidently,  more  works  are  still  needed  to  be  done. 

For  example,  what  is  the  sampling  distribution  of  the 
change-point  estimate  x?  Without  it,  the  confidence 
intervals  and  bounds  for  the  change-point  k  can  not  be 
calculated.  In  practical  applications,  most  data  collected  in 
a  time  order  tends  to  be  correlated.  Then,  the  question 
arises:  how  robust  is  the  rank  cusum  test  when  applied  to 
the  time  series  data? 
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Abstract 


This  paper  examines  errors  in  the  estimated  solution  vec¬ 
tor  X  to  the  linear  regression  problem 

y  =  Kx*  +  a,  S{e)  =  o,  £  =  S\ 

when  the  dominant  uncertainties  are  the  measuring  er¬ 
rors  e.  Backward  error  analysis  gives  the  hopelessly  pes¬ 
simistic  bound 


<  cond(S  ^K) 


l|s-^a||2 
II  S-^Kx*  lla 


by  assuming  the  worst  possible  combination  of  random 
errors,  an  extremely  unlikely  occurence  for  nontrivial 
problems.  A  statistical  treatment  yields  a  more  realistic 
bound  on  the  expected  uncertainty  in  a  single  element 
Xi  which  does  not  depend  on  cond(S“‘^K).  Classical 
regression  theory  provides  easily  computable  confidence 
intervals  for  the  individual  £»  separately. 


Notation  and  Test  Problem 

Statisticians  write  the  m  x  n  linear  regression  model  as 

Y  =  X)3  +  e,  f(e)  =  o,  f(c€^)  =  S^  (1) 

where  Y  is  a  measured  m- vector  containing  measuring 
errors  €,  X  is  a  known  m  x  n  matrix  with  m  >  n  = 
rank(X),  and  /3  is  the  vector  to  be  estimated.  Numerical 
analysts  write  the  linear  least  squares  problem  as 

=  T^ll^- Ax|ii  ,  (2) 

where  b  is  the  measured  vector,  A  is  the  mx  n  matrix, 
X  is  the  vector  to  be  estimated,  ||6—  Ax\\2  is  the  squared 
two-norm  of  the  residual  vector,  and  pjjg  is  the  mini¬ 
mum  sum  of  squared  residuals.  They  usually  assume 
(but  seldom  state)  the  linear  regression  model 

b  =  Ax*^6b,  E{6h)  =  0  ,  E{6h  Sb'^)  =  ,  (3) 


where  Im  is  the  mth  order  identity  matrix,  and  the  scalar 
cr  is  unknown. 

Since  choosing  either  of  the  above  notations  would 
deeply  offend  one  of  the  two  schools,  consider 

y  =  Kx*  +  e,  5(c)  =  o,  f  (e  e’’)  =  ,  (4) 

where  y  is  the  measured  rTwector,  and  K  is  the  known 
m  X  n  matrix  with  rank(K)  =  n.  This  notation  is  ap¬ 
propriate  when  linear  regression  is  applied  to  systems  of 
integral  equations  of  the  form 

yi=J  Ki{(,)x{i)d^  +  ii ,  i=l,2,...,m  ,  (5) 

where  the  ^  are  measured  values,  the  iiLi(^)  are  known 
functions,  and  a:(^)  is  the  function  to  be  estimated.  Such 
equations  are  widely  used  to  model  the  effects  of  a  mea^ 
suring  instrument  on  the  thing  being  measured.  One 
way  to  approximate  aj(^)  is  to  replace  the  integrals  with 
quadrature  sums,  i.e., 

i=i 

where  the  Wj  are  prescribed  quadrature  coefficients  and 
the  a5(i‘j)  form  a  discrete  approximation  to  aj(^).  It  is  im¬ 
portant  to  choose  n  large  enough  so  that  the  quadrature 
errors  are  small  relative  to  the  €i.  If  the  sums  are  substi¬ 
tuted  for  the  integrals  in  (5)  and  the  products  u}jKi{^j) 
collected  into  a  matrix  K,  the  result  is  the  model  (4). 

A  test  problem  capturing  many  of  the  salient  features 
of  real  instrument  correction  problems  is  obtained  by 
discretizing  the  Phillips  [5]  equation 

y{t)  =  j\{t,0xi0d^,  -6<t<6,  (7) 

with 

f  l  +  cos[lI^]  ,  |^-i|<3 

=  j  |t|<6  (8) 

(  0  ,  otherwise  , 
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and 


f  (6-|<l)[l  +  |cos(f)] 

1*1^6 

0  ,  otherwise  . 


(9) 


The  kernel  K{t,()  is  non-negative,  with  maximum  value 
2,  attained  on  the  line  t  =  ^.  The  solution  is 

!b(^)  =  /  1  +  C0s(2^)  ,  |e|<3 

1  0  ,  otherwise 

The  functions  y{t)  and  2(^)  are  plotted  in  Figure  1. 


x(f)  and  y(t)  for  Phillips  Problem 


Figure  1: 

Discretizing  replaces  continuous  variables  t  and  ^  with 
meshes  ti,i  =  and  (jyj  =  Choosing 

m  =  150  equi-spaced  ti  on  —5.925  <  *  <  5.925  and  using 
an  n  =  121  point  trapezoidal  rule  on  —3.0  <  ^  <  3.0  gave 

y*=Kx*,  (11) 

where  x*  is  a  121-vector  of  computed  by  (10),  and 
y*  was  computed  by  (11)  rather  than  (9)  to  assure  that 
the  ii  were  the  only  errors  in  the  model.  The  Ct  were 
obtained  by  random  sampling  from  JV'(o,  S^)  with 

S  =  diag(si,  32,  . . Sm)  ,  Si  =  (10“®)]/,*  i  (12) 

which  means  that  the  errors  in  the  ^  were  in  the  6th 
digit.  The  discretized  model  can  thus  be  written 

y*  =  Kx*,  y=Kx*-|-e,  e~N{o,S^),  (13) 

and  the  least  squares  estimate 

x=(K^S-2K)“'K^S-2y,  (14) 

computed  by  LINPACK  subroutines  DQRDC  and 
DQRSL  [2],  is  shown  in  Figure  2.  The  dashed  curve 
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Figure  2: 


is  ®(t)  and  the  jagged  curve  is  the  estimate.  The  large 
oscillations  are  induced  by  errors  in  the  6th  digit  of  the 
yi !  Such  ill-conditioning  is  typical  of  regression  models 
arising  from  discretized  first  kind  integral  equations. 

Classical  Perturbation  Theory 

To  simplify  the  discussion  in  this  section,  let 

b  =  S-^y,  A  =  S’-^K,  (15) 

and  rewrite  (13)  as 

b*=Ax%  b  =  Ax^-h«b,  6h^N{o,lm)^  (16) 

The  problem  of  interest  is  to  find  bounds  for  the  errors 
in  the  least  squares  solution  x  =  (A^A)'”^A^b. 

The  traditional  approach  ignores  x*  and  the  statistical 
assumptions  about  ^b,  seeking  instead  to  bound  the  dif¬ 
ference  between  estimates  corresponding  to  two  different 
b  vectors.  One  of  these,  b,  corresponds  to  the  problem 

||Ax  -  blla  =  min  =  pLS  ,  (17) 

and  the  other,  b-h  Ab,  corresponds  to  a  perturbed  prob¬ 
lem 

||(A  +  AA)x  -  (b  -h  Ab)l|2  =  min  ,  (18) 

where  Ab  and  AA  represent  the  uncertainties  in  b  and 
A.  The  regression  model  assumes  that  A  is  known  ex¬ 
actly,  or  at  least  to  much  higher  precision  than  b,  but 
numerical  analysts  argue  that  truncation  errors  arising 
when  A  is  read  into  a  finite-accuracy  computer  should 
be  taken  into  account.  A  long  and  intricate  argument 
[3]  leads  to  the  following  error  bound: 

l|x  -  *I|2  ^  -  /  2K(A)||b|l3  -I-  /)ls[k(A)]=*  \  , 

(19) 
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where 


e  = 


max^ 


llAAlh  llAbIbl 
I|A||2  ’  l|b|U  /  ’ 


(20) 


and 


/c(A)  =  cond(A)  = 


^niax(A.)  _  (Ti 


(21) 


is  the  condition  number  which  is  just  the  ratio  of  the 
largest  to  the  smallest  singular  value  of  A. 

While  numerical  analysts  are  fascinated  by  the  truncar 
tion  A  A,  people  who  actually  make  measurements  usu¬ 
ally  insist  on  a  computer  arithmetic  with  enough  preci¬ 
sion  to  render  such  perturbations  negligible  in  compar- 
ision  to  the  measurement  errors.  When  the  Computer 
Acquisition  Committee  at  the  National  Bureau  of  Stan¬ 
dards  was  writing  specifications  for  a  new  computer  in 
1984,  some  members  insisted  on  a  machine  with  64-bit 
single  precision  because  32-bit  machines  give  only  6  to  7 
digits  of  precision,  and  they  routinely  measured  things 
better  than  that.  Accordingly,  let  A  A  =  0.  This  leads 
to  the  more  easily  obtained  [6]  bound 


<  cond(A) 


l|Ab|U 

l|b||2 


which  also  depends  strongly  on  cond(A). 


(22) 


Multiplying  (26)  by  ||x*||2  and  squaring  both  sides  gives 


.r*I|2 


112  ^ 


[cond(S”^K)]^  Ijx*^ 


||S-iKx*||2 


||S-^6l|^  (28) 


Since  both  sides  are  non-negative  functions  of  the  ran¬ 
dom  vector  €,  it  follows  that 


5(|lx-x*||i)< 


[cond(S  ^K)]^||x*||| 
||S-iKx*||i 


f(||S-^6||i) 


(29) 

It  follows  from  (13)  that  ~  iV'(  o  ,  !,„  )  which  im¬ 
plies  ||S“^e||2  X^(^))  so  S  (IlS'^clll)  =  m.  Therefore 


2.  m  [cond(S-iK)]  ||x*||i 
y  ^  i|S-iKx*"2  ’  '  ^ 


which  relates  x  to  x* ,  but  with  the  elements  of  |x  —  x*  | 
muddled  together.  To  clarify,  define  |Aa;|rm«  by 


so  by  (30), 


(31) 


Assessing  the  Classical  Bound 

The  bound  (22)  is  computable,  but  it  does  not  relate  a 
computed  estimate  to  x*.  To  obtain  such  a  result,  let 

b  =  b*  =  Ax*,  Ab=:^b-iV'(o,Ir„),  (23) 

and  replace  problems  (17)  and  (18)  with 


||Ax*  ~  b*||2  =  min  =  0  ,  ||Ax  -  (b*^  +  ^b)||2  =  min  . 

(24) 

The  bound  (22)  then  becomes 


<  cond(A)  , 

||x*||2  -  ^  llAx’lla’ 


(25) 


which  is  not  practicable  because  it  depends  on  x*.  But 
X*  is  known  for  the  test  problem,  and  this  provides  a 
means  for  evaluating  the  perturbation  bound.  To  restore 
the  original  notation,  substitute  (15)  into  (25)  to  obtain 


ll^*l|2 


< 


cond(S-^K) 


l|S-^g||2 

||S-iKx*||3  ’ 


(26) 


where 


cond(S-^K) 


<ymax(S~^K)  _ 
^nuii(S  ^K)  CTn 


(27) 


^  (\/^)  '^^||siKx*||a  • 

The  quantity  |Aaj|rm«  is  the  expected  root  mean 
squared  absolute  error  for  the  components  of  x.  The  test 
problem  has  ||x*||2  =  13.82,  cri(S”^K)  =  3.3950  x  10®, 
and  ai2i(S’“iK)  =  1.1610.  Thus  cond(S-iK)  =  2.924  x 
10®,  and  by  (12), 

S-^Kx*  =  S- V’  =  (10®,  10®, . . . ,  10®)^  ,  (33) 

so  ||S“’^Kx*||2  =  1.225  x  lO*^.  Substituting  these  values 
into  (32)  gives  lA®|rma  <  3.67  x  10®,  a  wildly  pessimistic 
bound.  Figure  3  gives  a  componentwise  plot  of  the  actual 
errors  x  —  x*  with  the  true  values  of  ±lAa5|rm«  =  ±0.302 
plotted  as  dashed  lines. 

The  classical  bound  is  hopelessly  pessimistic  because 
it  does  not  take  the  random  nature  of  the  errors  into 
account.  Starting  with  a  measured  b  and  correspond¬ 
ing  solution  X,  it  considers  all  measured  vectors  b  -f  £b 
with  ||^b||2  <  ||Ab||2.  These  vectors  define  correspond¬ 
ing  solutions  X  =  X  -f  tfx,  and  to  make  the  boimd  hold 
with  certainty  for  all  b  -h  ib,  it  assumes  the  worst  pos¬ 
sible  combination  of  the  121  perturbations  5b.  When 
the  errors  are  drawn  randomly,  the  probability  of  such  a 
combination  is  negligibly  small. 
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Errors  in  Estimoted  Solution  Vector 


The  two-norm  is  invariant  with  orthogonal  rotations,  so 


whence,  by  (31), 


|A®lrm5  <  —  . 

CTn 


.0  0.0  1.0  2.0  3.0 


Figure  3: 

Statistical  Perturbation  Bounds 

A  more  reasonable  bound  can  be  obtained  by  considering 
the  statistical  properties  of  the  errors.  By  (13), 


This  bound  is  computable  without  knowing  x*,  and  it 
does  not  depend  on  cond(S'’^K).  For  the  test  problem, 
|A®|rm5  <  0.861,  which  exceeds  the  true  value  by  a  fac¬ 
tor  of  only  2.85. 

Confidence  Intervals 

Both  the  classical  and  statistical  perturbation  analyses 
are  rendered  moot  by  confidence  interval  calculations.  If 
X  is  the  least  squares  solution  for  the  model  (13),  then 

xr^N\x*,{K^S-^K)~^]  ,  (43) 


whence 


_  X*)  ~  Jo,  (k’’ S“®K)  ,  (34)  so  the  variances  of  the  invidual  Xj  are  given  by 

V(xj)  =  eJ(K^S-2K)"'ef  ,  i=l,2,...,n,  (44) 

(x-x*)TK^S-2K(x-x*)~x*(n),  (35) 

where  is  the  unit  vector  with  1  as  the  jth  element. 
For  any  probability  a(0<a<  1),  if/cis  chosen  to 
//A.  satisfy 


S  {(x  --  x*)^K^S'2k(x  -  X*)}  =  n .  (36) 


Now  consider  the  singular  value  decomposition 

S-^K  =  u(q)v^,  S  =  diag(o-i,cr2,...,ff„)  , 
U^U  =  I„»,  V’'V  =  I„, 

(37) 

Substituting  into  (36)  and  simplifying  gives 

=  (3«) 

and,  since  (Tn  is  the  minimum  singular  value. 


Dividing  through  by  gives 


then 

Pr  I  Jxj-  -  Ky/v(®j)]  <  <  [“j  +  «v/v(xj)  I  =  a  . 

(46) 

The  /c-value  for  a  =  .95  is  k  =  1.96.  Figure  4  shows  the 
95%  confidence  bounds  for  the  test  problem.  The  dashed 
line  is  the  true  solution  and  the  jagged  lines  connect  the 
upper  and  lower  bounds  for  the  individual  Xi , 

If  =  s^Im,  with  s  unknown,  then  the  estimate 
3^  =  (m  —  Ur)’"^p5,5  can  be  used  to  construct  confi¬ 
dence  intervals,  though  the  relation  between  tc  and  a 
will  be  different  from  (45).  If  the  e-distribution  is  un¬ 
known,  confidence  intervals  can  be  constructed  from  the 
Chebeyshev  inequality.  Though  wider  than  those  for 
normally  distributed  errors,  these  intervals  are  often  or¬ 
ders  of  magnitude  smaller  than  the  ±\Ax\rms  bounds 
from  classical  perturbation  theory. 

The  keynote  speaker  [7]  pointed  out  that  the  variance 
matrix  for  Xj  was  known  to  Gauss,  and  that  modern 
least  squares  algorithms  could  easily  compute  it  by  in¬ 
verting  an  upper  triangular  matrix  formed  in  solving 
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Figure  4: 

for  X.  Unfortunately,  the  least  squares  subroutines  in 
the  widely  used  LINPACK  [2]  and  LAPACK  [1]  collec¬ 
tions  do  not  return  confidence  intervals,  or  even  the  vari¬ 
ance  matrix.  The  LINPACK  manual  describes  how  to 
compute  variances  from  a  reduced  matrix  returned  by 
subroutine  SQRDC,  but  the  LAPACK  manual  is  silent 
on  the  subject,  and  neither  mentions  confidence  inter¬ 
vals,  concentrating  instead  on  the  classical  perturbation 
bounds.  Secondary  sources,  which  use  these  collections, 
have  continued  this  preoccupation  with  what  are  essen¬ 
tially  useless  bounds.  They  also  continue  to  propagate 
misinformation  about  the  condition  number.  For  exam¬ 
ple,  the  textbook  of  Kahaner,  et.  al  [4]  states  that: 

One  useful  interpretation  of  the  condition  num¬ 
ber  is  that  its  logarithm  approximates  the  num¬ 
ber  of  digits  which  will  be  lost  while  solving 
Ax  =  h.  Thus  if  cond(A)  =  10®  and  if  machine 
epsilon  is  10”®,  then  the  best  we  can  expect 
is  that  the  solution  will  be  accurate  to  about 
three  digits. 

The  estimate  in  Figure  2  was  calculated  in  double 
precision  with  ^niach  ~  ^  10”^®,  and  since 

cond(S“^K)  =  2.92  x  10®,  the  above  reasoning  would  in¬ 
dicate  that  the  computed  x  is  accurate  to  6  digits.  But 
consider  the  same  calculation  in  single  precision  with 
®mach  =  ^  cond(S”iK)  =  2.93  x  10®. 

According  to  the  conventional  wisdom,  a  computed  es¬ 
timate  should  not  contain  any  digits  of  accuracy.  The 
actual  single  precision  estimate  is  shown  in  Figure  5. 
The  slight  dilFerences  from  the  double  precision  estimate 
are  difficult  to  see  by  comparing  the  two  plots.  The  rms 
average  difference  between  the  two  estimates  is  0.0033 
which  is  almost  100  time  smaller  than  the  \Ax\rm8  ei¬ 
ther  estimate,  so  in  practice,  either  estimate  would  serve 
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Figure  5: 


equally  well.  Clearly  the  condition  number  is  not  always 
a  good  indicator  of  the  accuracy  of  the  estimate. 
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Abstract 

Compartment  models  are  widely  used  in  pharmacokinet¬ 
ics.  Our  objective  is  to  fit  compartment  models  to  da¬ 
ta.  General  numerical  optimization  methods  frequently 
perform  poorly  for  this  purpose.  A  good  book  on  nu¬ 
merical  optimization,  such  as  Dennis  and  SchnabeTs[3] 
describes  multiple  techniques  and  discusses  the  advan¬ 
tages  and  disadvantages  of  each.  In  order  to  implement 
these  methods  in  a  software  product,  one  must  make  a 
number  of  decisions.  For  example,  should  a  line  search 
method  or  a  trust  region  method  be  used?  How  should 
variables  on  widely  different  scales  be  handled?  In  gen¬ 
eral,  these  are  questions  without  clear-cut  answers. 

Compartment  models  are  defined  by  linear  differen¬ 
tial  equations.  Consequently,  compartment  models  have 
a  particular  structure.  We  have  tailored  general  opti¬ 
mization  methods  to  exploit  this  structure.  Through 
study  and  experimentation  we  have  found  workable  an¬ 
swers  to  the  questions  posed  above. 


1  Introduction 

Compartment  models  are  illustrated  with  a  study  of  gold 
kinetics.  I  will  give  some  background  for  the  study,  the 
kinetic  diagram,  the  differential  equations,  and  the  da¬ 
ta.  The  scaling  problem  is  described,  and  a  method  for 
dealing  with  it  is  presented.  A  variety  of  algorithms  are 
described  our  preference  is  stated. 


2  An  example  model 

This  example  from  Gerber,  efa/[4]  deals  with  gold  ki¬ 
netics.  The  effects  of  aurothiomalate  therapy  last  far 
beyond  the  time  where  there  are  measurable  serum  lev¬ 


els.  However,  whole-body  radiation  counts  can  made 
over  any  interval  of  time.  In  this  study,  serum  levels 
and  whole-body  counts  are  simultaneously  fit  to  a  two 
compartment  model.  We  assume  the  blood  serum  is  a 
compartment,  and  the  remainder  of  the  body  is  a  com¬ 
partment.  The  compartments  and  flows  are  depicted  in 
Figure  1.  The  parameters  k21,  kl2,  and  kOl  are  called 


Figure  1:  Model  for  gold  kinetics  study 
rate  constants. 

Aurothiomalate  is  injected  into  the  blood  serum.  At 
several  values  of  elapsed  time,  two  types  of  observations 
are  made:  the  concentration  in  the  serum  and  a  radioac¬ 
tive  count  on  the  whole  body.  Dose  1  is  the  initial  val¬ 
ue  in  the  serum  compartment  in  units  of  concentration. 
Dose  2  is  the  initial  value  of  the  sum  of  the  two  compart¬ 
ments  in  units  of  counts.  For  theoretical  discussions,  the 
parameters  are  in  an  indexed  vector  5.  We  use  the  fol¬ 
lowing  correspondence:  ^!21  =  0i,  k\2  =  02 j  ^01  = 
Dose  1  =  04,  and  Dose  2  =  05. 

The  differential  equations  associated  with  the  kinetic 


1 
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diagram  are 

{  Ai(‘)  /I  «>  -*2  A  J 

The  subscript  j  is  used  to  distinguish  between 
solutions  resulting  from  different  initial  conditions. 
Specifically,  (  Pij(t)  P2j{t)  is  the  solution  when 

(  Piy(O)  P2j(0)  is  the  jth  elementary  vector.  Pij{t) 
is  the  proportion  of  material  that  goes  from  compart¬ 
ment  j  to  compartment  i  in  the  time  interval  (0,^).  For 
the  deterministic  form  of  the  model  we  need  only  the 
solution  for  jf  =  1  because  the  dose  is  administered  in 
the  first  compartment.  The  solution  is 

\  \  — (^1  +02  +  Os)  d:  ^(^1  +  02  +  Os)^  ^  40203 

Ai,A2  —  f) 


Pnii) 


-Ai  (^3  +  A2)  exp(Ait)  +  A2(^3  +  Ai)  exp(A2t) 
^3(A2  -  Ai) 


(^3  +  A2)  exp(Ait)  -  (^3  -t-  Ai)  exp(A2t) 

A2  -  Ai 

where  P.i(^)  =  Pii(i)  +  ^^2i(0- 

The  analytical  solutions  given  above  are  to  make  the 
example  precise  and  self-contained.  In  practice,  a  com¬ 
puter  can  solve  the  required  differential  equations  and 
also  find  the  derivatives  of  the  solutions  with  respect  to 
the  parameters. 

The  data  were  read  from  figures  in  Gerber  eial  by 
Uno  etal[5]i  and  are  given  in  Table  1.  The  expected 
values  of  the  observations  yi  axe 


Piit)  = 


{^4  X  ■Pii(^»)  for  i  <  7 

h  X  (Pii(ii)  +  P2i(<.))  for  i  >  8. 

The  method  of  estimation  is  to  find  the  value  of  the 
parameter  vector  0  so  that 


is  minimized.  The  parameter  A  is  specified  by  the  data 
analyst  to  stabilize  the  variances  of  the  y,*.  For  example, 
A  =  0.5  gives  the  square  root  transformation,  and  A  = 
0.0  gives  the  logarithmic  transformation.  See  Box  and 
Cox[2]  for  theory  and  strategies  of  choosing  A. 


i 

Site 

U 

Vi 

1 

serum 

1.06 

354.0 

2 

serum 

2.13 

284.0 

3 

serum 

3.19 

238.0 

4 

serum 

4.26 

200.0 

5 

serum 

5.11 

175.0 

6 

serum 

6.17 

145.0 

7 

serum 

7.23 

128.0 

8 

body 

0.00 

100.0 

9 

body 

3.11 

87.23 

10 

body 

5.19 

79.79 

11 

body 

14.53 

60.64 

12 

body 

21.79 

54.26 

13 

body 

41.51 

46.81 

14 

body 

62.26 

44.68 

15 

body 

97.55 

41.49 

16 

body 

174.34 

36.17 

17 

body 

215.85 

34.04 

Table  1:  The  data 


3  Dealing  with  the  scaling  prob¬ 
lem 


The  two  classes  of  parameters,  rate  constants,  and  initial 
values,  have  vastly  different  scales.  Ordinary  nonlinear 
regression  algorithms  can  be  very  slow  to  converge. 

The  problems  can  be  demonstrated  using  a  model 
simpler  than  the  one  presented  in  the  preceding  section 

k 

Vi  =  D .  (exp(-M.)  -  exp(-*e<<))  + 

Figure  2  contains  two  response  curves  for  this  model. 
There  are  two  rate  constants,  ka  and  Ae,  and  they  are 
the  same  for  each  curve.  The  initial  value,  D,  for  the  low¬ 
er  curve  is  one  tenth  the  initial  value  in  the  upper  curve. 
Artificial  data  are  taken  from  the  upper  curve  with  no 
error  term.  Parameters  of  the  lower  curve  are  used  as 
starting  values  for  a  common  algorithm.  Convergence  is 
very  slow  even  though  two  of  the  three  parameter  esti¬ 
mates  are  correct. 

We  return  our  attention  to  the  gold  kinetics  study. 
When  A  =  1.0,  S4  and  ^5  are  conditionally  linear  param¬ 
eters.  Bates  and  Watts[l]  discuss  handling  conditionally 
linear  parameters.  The  method  is  to  alternate  between 
normal  iterations  and  off  iterations  where  the  rate  con¬ 
stants  are  fixed. 

I  propose  doing  exactly  the  same  thing  even  when 
A  ^  1.0.  While  these  off  iterations  are  not  linear  prob¬ 
lems,  the  process  allows  the  initial  value  parameters  to 
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Figure  2:  A  pathological  example 

partially  adjust  to  the  current  values  of  the  rate  con¬ 
stants.  Experience  with  several  examples  has  shown  that 
speed  of  convergence  can  be  dramatically  improved  using 
this  technique. 

4  Choice  of  method 

Newton’s  method  for  minimizing  the  objective  function 

(1)  is  probably  best,  but  because  it  requires  second 
derivatives,  is  infrequently  used.  A  modified  Gauss- 
Newton  method  is  nearly  always  used.  The  question 
is  which  modification  should  we  use? 

We  use  the  following  notation  to  describe  possible 
estimators: 

m  =  ^Q{6) 


All  nonlinear  regression  algorithms  are  iterative.  At  each 
iteration  the  current  values  of  the  estimates  are  updated 
by  adding  an  adjustment  vector.  Possible  formulas  for 
the  adjustment  vectors  are: 

iJiefJ{O))-^Ji0fr{e)  (2) 

{Ji6fj{e)+7l)-^Ji0fr{e)  (3) 

aiJiefmr^Jiofrie)  (4) 

a{Ji9fm  +  7ir^J{efrie)  (5) 


Expression  (2)  is  the  Gauss-Newton  formula  which  will 
often  not  converge.  Expression  (3)  is  the  Levenberg- 
Marquardt  formula.  This  method  is  popular,  and  many 
different  strategies  for  choosing  7  have  been  proposed.  A 
combination  of  expression  (2)  and  expression  (3)  is  called 
the  trust  region  method  and  is  detailed  in  Dennis  and 
Schnabel[3].  Expression  (4)  is  the  line  search  formula 
and  is  used  with  a  strategy  for  choosing  cr. 

I  prefer  a  line  search  algorithm,  backtracking  with 
cubic  interpolation  when  required.  This  is  also  detailed 
in  Dennis  and  Schnabel.  For  the  search  direction,  I  use 
expression  (5)  with  7  fixed  at  some  small  number.  The 
search  parameter  a  adjusted  at  each  iteration  using  cubic 
interpolation.  Fixing  7  to  be  positive  avoids  having  to 
check  if  J{6)  is  singular.  The  nature  of  compartnaent 
models  is  such  that  Qi0)  and  its  directional  derivatives 
along  the  search  line  are  easy  to  compute.  The  cubic 
interpolation  provides  an  intelligent  update. 
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ABSTRACT 

Extending  results  of  Dawid  (1973),  O’Hagan  (1979), 
Meeden  k  Isaacson  (1977),  and  Angers  k  Berger  (1991), 
we  develop  a  general  theory  of  model  behavior  for  dif¬ 
ferent  distributional  assumptions  in  a  hierarchal  model 
in  the  presence  of  outlying  data.  The  score  function  of 
a  density  allows  characterization  of  densities  into  four 
groups  based  on  their  tail  behavior.  Using  convolution 
theory,  we  characterize  the  behavior  of  a  location  pa¬ 
rameter  estimator  in  a  hierarchical  model  depending  on 
the  group  membership  of  the  densities  involved.  These 
results  extend  to  multivariate  distributions  under  the 
assumption  of  exchangeability.  Using  mixture  distri¬ 
butions  (Andrews  and  Mallows,  1974)  we  implement  a 
Gibbs  Sampler  for  prototypes  from  these  groups.  This 
theory  indicates  the  model  behavior  for  most  commonly 
used  distributions,  including  a  variation  of  the  multivari¬ 
ate  Laplace. 

1  INTRODUCTION 

We  are  interested  in  the  sensitivity  of  hierarchical  models 
to  the  distributions  specified  for  each  level.  If  we  assume 
a  two  level  model, 

Y\d,(r^  ^  fY\e,^^(y\0,(r^) 

we  can  estimate  the  unknown  parameters  9  using  Em¬ 
pirical  or  Hierarchical  Bayes  methodology.  We  would 
like  some  method  for  knowing  how  the  estimate  of  0  will 
behave  for  different  structural  assumptions  on  /y  and 
/©.  It  is  well  known  that  conjugate  densities  can  lead 
to  undesirable  behavior  (e.g.  Lindley  k  Smith,  1972)  as 
the  data  and  the  prior  information  become  discrepant, 
i.e.  |y  —  /i|  — ►  oo,  by  always  compromising  between  the 

*  Research  supported  by  a  National  Defense  Science  and  Engi¬ 
neering  Graduate  Fellowship 


likelihood  and  prior.  This  has  led  to  research  on  the  be¬ 
havior  of  the  posterior  mean  for  nonconjugate  densities 
as  |y  -  ^1  — y  oo.  Assuming  ^  is  a  location  parameter, 
Dawid  (1973)  and  O’Hagan  (1979)  derived  conditions 
such  that  the  posterior  tends  to  the  prior,  thus  reject¬ 
ing  the  information  from  the  likelihood.  Reversing  the 
conditions,  the  posterior  behaves  as  the  likelihood.  Hill 
(1974)  extended  these  results  to  the  multivariate  setting. 
Sanso  and  Pericchi  (1992)  examined  behavior  for  a  nor¬ 
mal  likelihood  and  Laplace  prior,  finding  that  the  pos¬ 
terior  mean  tends  to  y  —  c  where  c  is  some  constant, 
and  thus  the  prior  exerts  hounded  influence.  Angers 
and  Berger  (1991)  examined  the  multivariate  behavior 
for  a  Cauchy  prior.  Meeden  and  Isaacson  (1977)  devel¬ 
oped  similar  theory  for  /y  an  exponential  family  and  0 
the  canonical  parameter  which  was  extended  by  Peric¬ 
chi,  Sanso,  and  Smith  (1993)  to  expectation  parameters. 
Along  similar  lines,  Luccis  (1993)  developed  conditions 
for  posterior  normality  when  both  densities  belong  to 
the  Box-Tiao  family. 

We  refer  to  these  results  as  “what-if”  asymptotics. 
They  indicate  how  the  model  behaves  as  |y  —  oo, 
such  as  ignoring  either  the  prior  or  likelihood,  always 
compromising  between  them,  or  exhibiting  bounded  in¬ 
fluence.  Our  aim  is  to  develop  a  general  theory  that 
will  describe  the  model  behavior  for  arbitrary  paramet¬ 
ric  forms  of  the  likelihood  and  prior  when  0  is  a  loca¬ 
tion  parameter.  To  accomplish  this,  we  use  a  scheme 
for  classifying  densities  into  disjoint  classes  based  on  tail 
behavior,  and  demonstrate  that  the  relative  ordering  of 
the  tails  determines  posterior  behavior.  Thus,  the  prob¬ 
lem  of  selecting  densities  for  each  level  of  a  hierarchical 
model  simplifies  to  determining  class  membership.  We 
extend  the  results  to  the  multivariate  setting  and  addi¬ 
tional  levels  in  the  hierarchy.  While  research  in  this  area 
generally  assumes  the  variance  components  are  known, 
our  results  also  indicate  the  behavior  of  the  posterior  di^ 
tributions  of  <t^  and  r  for  general  priors  on  these  scale 
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parameters.  We  also  demonstrate  a  Gibbs  Sampling  im¬ 
plementation  that  immediately  provides  estimates  of  the 
desired  posterior  distributions  for  prototypes  from  each 
of  these  classes,  and  thus  provides  information  for  all 
densities  in  that  class. 

2  DENSITY  CLASSIFICATION 

To  classify  a  density’s  tail  behavior,  we  utilize  the  neg¬ 
ative  log  rate,  NLR/(«)  =  -^log/(a:).  This  is  equal 
to  minus  the  score  function  and  will  be  applied  to  likeli¬ 
hoods  and  priors.  Using  a  classification  scheme  adapted 
from  Gomez- Villegas  and  Main  (1992),  we  classify  a  den- 
sity  as 

•  Very  Light  if  NLR/(a;)  — +  oo 

•  Light  if  NLR/(ar)  c,  0  <  c  <  oo 

•  Medium-Heavy  if  NLR/  (x)  — ►  0 

-  Medium  if  xNLR/(x)  — ^  oo 
—  Heavy  if  xNLR/(x)  — ^  c, c  <  oo, 

where  all  limits  are  as  x  oo.  This  scheme  agrees 
with  our  intuition  by  classifying  a  Normal  density  as 
Very  Light,  a  Laplace  density  as  Light,  and  the  t  as 
Heavy.  We  see  that  the  tail  of  fi  is  heavier  that  the  tail  of 
/2  if  lim  NLR/i(x)  <  lim  NLR/2(x).  As  shown  below, 
this  tail  characteristic  determines  if  our  estimates  will 
compromise  or  ignore  the  information  from  the  densities 
involved. 

3  CONVOLUTION  THEORY 

Using  ideas  from  convolution  theory,  we  obtain  a  theory 
that  determines  posterior  behavior  based  on  the  rela¬ 
tive  NLR’s  of  the  likelihood  and  the  prior.  Since  we 
are  assuming  0  is  a  location  parameter,  fY\e{y\^)  can  be 
rewritten  as  fy^giv  —  some  pivot  density  /*,  and 

the  marginal  density  f  fy^eiv  ”  0)'K{6)d9  is  the  convo¬ 
lution  of  /*  and  IT,  Berman  (1992)  developed  theory  for 
the  behavior  of  this  convolution  as  y  — ►  oo.  We  have 
extended  this  theory  so  that  it  may  be  applied  to  our  hi¬ 
erarchical  models.  Below  is  our  main  result,  see  Chance 
(1994)  for  details. 

Theorem  1  Suppose  NLR^  is  a  regularly  varying  func¬ 
tion^  NLRf  >  0,  g{t)  has  finite  expectation,  and 

lim  sup  NLRir{y)  <  lim  inf  NLRf{y) 

y— ►oo  y— ►<» 


then 

J  g{y-0)fY-siy-O)M9)d0 

'-My)  j  9{0)fP-0i0)e^^^^’^^^d0 

for  y  -*  oo,  when  e*^^My)fp_g(t)dt  <  oo,  and 

<  oo. 


When  g{t)  =  1,  this  tells  us  that  when  we  have  an  ex- 
treme  data  value,  the  marginal  behaves  as  the  prior,  eval¬ 
uated  at  the  data  point,  times  a  correction  factor.  Note, 
assuming  NLR^  regularly  varying  is  not  a  very  restrictive 
assumption  since  taking  the  logarithm  suitably  dampens 
commonly  used  density  functions.  Applying  Theorem  1 
with  g{y  —  0)  =  1  and  g{y  —  0)  =  y  —  we  see  that  the 
posterior  expectation  of  g{y  —  0)  behaves  as: 


Ee\yi9(y  “  ®)l2/) 


S  g{y-t)f*iy~*)^it)dt 

f  9(t)f*{t)e*^^^’^y'>dt 

f  f*{t)e^^^My)dt 


We  can  further  manipulate  the  equations  to  obtain  an 
expression  for  the  posterior  density: 


Pe|y(<'|y)  ~  |~^/.(j,_0)e(y-»)NLR,(y)rf0 

/*(y_0)e-«NLR,(y) 

fZo  f*(y  -  6»)e-®NLR.(y)rf0 ' 


Thus,  Theorem  1  directly  implies  that  when  the  prior 
has  the  heavier  tail,  the  marginal  density  behaves  as  the 
prior  density  times  a  correction  factor  and  the  posterior 
as  the  pivot  density  times  a  correction  factor  as  y  oo. 
When  NLRx(y)  0,  the  marginal  behaves  as  the  prior, 
a  result  closely  related  to  Brown's  (1988)  heuristic,  and 
the  posterior  mean  of  g{y  —  0)  goes  to  f  g{0)f*{0)d0. 
When  m  is  an  indicator  function  we  see  that  the  pos¬ 
terior  distribution  tends  to  the  invariant  distribution  of 
the  pivot  Y  —  0,  and  E{y  —  0\y)  — >  Ef*{y  —  ^)  =  0,  that 
is  E{0\y)  y.  This  is  the  result  given  by  Dawid  (1973) 
and  O'Hagan  (1979).  When  the  prior  is  a  Light  tailed 
density,  we  can  often  evaluate  the  correction  factor  ex¬ 
actly. 


Example  Let  f{y\0)  ^  N(0,(t^)  and  7r(0)  DE{0,r^). 

Since  NLR/.(y)  =  y  and  NLR,r(y)  =  A  =  ^,  which  is 
regularly  oscillating,  the  conditions  of  the  theorem  are 
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met.  Applying  the  result, 


/  f*{y  -  6)e>'^y-«)d9 

(Ty/^ 

N{y^a^X,cr^y 


This  implies  that  \y  -  E{0\y)\  cr^A,  the  result  given 
by  Sanso  and  Pericchi  (1992),  see  also  Pericchi,  Sanso, 
k  Smith  (1993),  and  Lucas  (1993). 


Note,  because  of  the  symmetry  in  the  problem,  we  can 
reverse  the  role  of  the  prior  and  likelihood.  For  example, 
if  the  likelihood  is  heavier,  we  see  that  the  marginal  tends 
to  the  pivot  density,  and  the  posterior  to  the  prior,  with 
the  appropriate  correction  factors. 


The  main  power  of  these  results  is  that  once  we  know 
which  classes  the  distributions  belong  to,  we  immediately 
see  how  our  estimate  for  9  will  behave  as  y  oo.  While 
these  are  asymptotic  results,  we  have  found  they  provide 
reasonable  approximations  for  intermediate  data  values. 
Moreover,  we  can  extend  the  analysis  to  determine  gen¬ 
eral  behavior  for  additional  levels  in  the  hierarchy,  scale 
parameters,  and  multidimensional  parameters. 

3.1  Higher  Levels 

Many  applications  of  hierarchical  models  contain  three 
or  more  levels.  For  example,  in  educational  research, 
we  may  have  repeat  observations  on  students  that  are 
grouped  by  classroom.  Thus,  we  can  see  how  both  the 
student  level  and  classroom  level  parameter  estimates 
will  behave  by  repeatedly  applying  the  above  theorem. 


In  general,  in  a  three  level  model,  the  behavior  of  the 
first  level  parameter  will  depend  on  the  first  level  distri¬ 
bution  and  the  convolution  of  the  higher  level  densities. 
Thus,  if  either  of  the  higher  level  densities  are  Heavy, 
while  the  first  density  is  not,  the  information  from  all 
higher  levels  will  be  asymptotically  ignored  in  the  poste¬ 
rior  density.  If  only  the  first  level  is  Heavy,  the  posterior 
tends  to  the  heavier  of  the  remaining  densities.  Similarly, 
when  the  second  level  parameter  is  a  location  parameter, 
we  can  describe  the  behavior  of  its  posterior  density.  In 
this  case,  if  only  the  first  level  is  Heavy,  the  posterior 
tends  to  the  prior,  with  mean  at  the  third  level  mean. 
Clearly,  the  value  specified  for  the  third  level  parameter 
becomes  important.  This  analysis  can  easily  be  extended 


to  additional  levels. 

3.2  Scale  Parameters 

Theorem  1  assumes  0  is  a  location  parameter,  how¬ 
ever,  in  many  applications  it  would  be  fruitful  to  know 
the  sensitivity  of  the  variance  component  estimates.  Of¬ 
ten,  we  can  obtain  similar  theory  for  scale  parameters  by 
reparameterizing  them  as  location  parameters.  This  the¬ 
ory  applies  when  the  likelihood,  /(<T|y),  is  proportional  to 
/(£^)  ^  likelihood  of  this  form  can  be  reexpressed  as 

—  logcr).  Transforming  to  ^  =  logcr, 
^  is  a  location  parameter  and  when  Theorem  1  is  appli¬ 
cable,  we  have  an  approximation  for  the  posterior  of  6 
as  logs(y)  — ►  00.  While  the  NLRs  of  the  transformed 
densities  are  often  not  regularly  oscillating,  in  practice 
we  have  found  the  resulting  approximations  to  still  be 
quite  reasonable.  We  are  currently  attempting  to  gen¬ 
eralize  the  behavior  when  location  and  scale  parameters 
are  assumed  unknown  simultaneously. 

3.3  Multidimensional  Problems 

When  y  is  a  p  variate  random  vector  with  mean 

=  {9ij , , ,  y6p)y  and  we  assume  the  t/,*  ’s  are  conditionally 
independent  and  the  ^,*’s  are  exchangeable,  we  can  apply 
Theorem  1  in  each  coordinate.  Thus,  if  one  coordinate, 
say  yj  becomes  large,  the  limiting  behavior  for  the  pos¬ 
terior  mean  for  6j  will  converge  to  some  limiting  value  as 
dictated  by  Theorem  1.  When  the  variance  components 
are  assumed  known,  the  posterior  means  for  the  other 
coordinates  will  behave  as  if  the  outlying  coordinate  did 
not  exist,  displaying  the  expected  shrinkage  phenomena. 
When  the  variance  components  are  modeled  as  unknown, 
the  mean  components  will  still  be  linked  together,  and 
thus  the  other  components  will  not  completely  reject 
the  outlier.  This  was  shown  to  be  true  for  the  Normal- 
Cauchy  case  by  Angers  and  Berger  (1991). 

4  IMPLEMENTATION 

Since  class  membership  of  the  tails  determines  model 
behavior,  if  we  can  obtain  estimates  for  a  representative 
from  each  class,  we  will  know  how  all  members  of  that 
class  behave  as  p  oo.  Below  we  describe  a  Gibbs  Sam¬ 
pling  implementation  using  scale  mixtures  of  Normals 
that  allows  estimation  for  Normal  (Very  Light),  Dou¬ 
ble  Exponential  (Light)  and  t  densities  (Med-Heavy)  by 
multiplying  the  Normal  scale  parameter  by  A  and  then 
specifying  a  density  for  A.  If  we  wanted  to  model  a  par¬ 
ticular  density  that  is  not  easily  implemented,  we  may  al¬ 
ternatively  obtain  estimates  from  these  prototypes  that 
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will  be  robust  to  outliers.  The  final  estimates  for  A  also 
provide  a  diagnostic  for  outliers,  see  for  example,  Seltzer 
(1993)  and  Racine-Poon  (1992). 

4.1  Student’s  t  Density 

A  multivariate  t  distribution  can  be  obtained  by  mix¬ 
ing  a  multivariate  Normal  distribution  with  a  Gamma. 
Let  y  -  and  A  -  IG(|,  f ).  Then  Y  - 

t^(/x,  E)  once  we  integrate  out  A  because 

p{y)  =  r  AE)JG(i//2,  u/2)dX 

Jo 

lS|l/2^n/2r(|)  I- 

The  conditional  distribution  for  Ajy  will  then  be  Inverse 
Gamma(*^^,  and  we  can  immediately  add  this 

distribution  to  a  Gibbs  Sampler. 


4.2  Double  Exponential/ Laplace  Density 

In  the  univariate  case,  the  Double  Exponential  can 
be  found  by  mixing  a  Normal  with  an  Exponential  with 
mean  2  (Andrews  &  Mallows,  1974): 

1  ly-Ml  1  - 1 

2<t  Jq  <7\/27rA  2 

Note,  this  density  has  variance  2<7^,  so  we  often  use 
“  as  the  variance  of  the  Normal  distribution.  If  we 
extend  this  idea  to  the  multivariate  case,  the  resulting 
multivariate  density  for  y  unconditional  on  A  is  the  sym¬ 
metric  multivariate  Bessel  distribution.  To  see  this,  let 
y  -  Nnifi,  A^S)  and  A  -  Exp{2).  Then 


density  resulting  from  this  mixture  as  a  Double  Expo¬ 
nential.  The  full  conditional  for  X\y  will  be  Generalized 
Inverse  Gaussian(l  —  1,  s^). 

4.3  Gibbs  Implementation 

By  adding  A  to  our  hierarchy,  the  distributions  condi¬ 
tional  on  A  will  be  Normal  and  we  can  utilize  conjugacy 
to  find  their  exact  form.  The  Inverse  Gamma  is  gener¬ 
ated  by  inverting  a  Gamma  random  variable.  To  gener¬ 
ate  from  the  Generalized  Inverse  Gaussian,  GIG,  we  use 
a  rejection  algorithm.  If  7  <  0  we  take  the  reciprocal 
of  a  GIG(-7, a,/?).  If  7  >  1  the  density  is  log  concave 
and  we  use  the  ‘‘non-uni versal  rejection  algorithm”  given 
by  Devroye  (1986).  Let  f{x)  represent  our  GIG  density 
function,  and  set  h{x)  =  logf(x).  Since  /  is  log-concave, 
h  can  be  majorized  by  the  derivative  of  h  at  any  point, 
which  corresponds  to  fitting  an  exponential  curve  over  /. 
Thus,  we  use  a  piecewise  majorizing  function,  5r(ar),  for 
/(ic),  where  the  first  piece  is  an  exponential  curve,  the 
second  piece  is  /  evaluated  at  the  mode,  and  the  third 
piece  is  another  exponential  curve.  We  select  points  a 
and  b  to  attach  the  exponentials  so  that  the  area  under 

g{x)  is  minimized.  Let  m  =  ^  be  the 

mode  of  /,  fj  the  tail  to  the  left  of  the  mode,  and  fr  the 
right  tail.  Theorem  2.6  of  Devroye  states  that  the  area 
will  be  minimal  if  we  choose  a  and  b  such  that 

m  +  a  —  ff,  (—  ) 

ni-b=fj 

We  use  a  binary  search  to  find  these  cross  points.  This 
tells  us  where  to  attach  the  exponential  curves  and  gives 
us  a  piecewise  majorizing  function. 


f{y)  = 

2  Jq 

=  2“7r"^(r””s“  2  i  f 

2  2Jq 

=  tt"  ^  "a  Ai-  a(s) 

where  K^{w)  is  the  modified 

Bessel  function  of  the  third  kind.  The  final  equation 
is  the  Multivariate  Bessel  density  (Fang,  Kotz,  k  Ng, 
1990)  with  parameters  a  =  1  —  f  and  6  =  ^.  The 
density  is  elliptically  symmetric.  If  n  =  1  the  density  is 
the  univariate  Double  Exponential.  If  n  =  2,  the  distri¬ 
bution  has  been  labeled  the  bivariate  Laplace  distribu¬ 
tion.  Thus,  this  mixture  provides  us  with  a  variant  of 
the  multivariate  Double  Exponential,  and  we  refer  to  the 


If  7  <  1,  we  use  the  above  algorithm  for  the  three 
regions  to  the  left  of  the  infection  point,  a,-  =  and 
majorize  the  region  to  the  right  of  the  inflection  point 
by  a  pareto  curve  (^).  In  this  case  we  need  to  calcu¬ 
late  the  Bessel  function  in  the  constant  of  integration  in 
the  GIG  density.  The  inflection  is  always  to  the  right 
of  the  mode.  If  the  inflection  point  falls  to  the  left  of 
m-f  a,  then  we  attached  the  second  exponential  curve  at 
the  inflection  point.  Figure  1  shows  the  density  and  the 
piecewise  majorizing  function  for  a  =  1, 7  =  .5.  If  7  =  .5 
we  actually  generate  from  a  Reciprocal  Inverse  Gaussian 
by  inverting  an  observation  from  an  Inverse  Gaussian. 
To  sample  from  an  Inverse  Gaussian  we  use  the  algo¬ 
rithm  of  Michael,  Schucany,  and  Hass  given  in  Devroye 
(1986)  using  a  many-to-one  transformation  (p.  148-149). 
In  our  hierarchical  models,  7  =  1  —  ^  so  the  only  val- 
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Figure  1:  Generating  from  a  GIG  density 


ues  of  0  <  7  <  1  we  need  to  consider  are  7  =  0,  .5  as 
n  is  an  integer.  The  above  algorithm  extends  that  used 
by  Carlin  and  Poison  (1991)  which  dealt  with  the  case 
n  =  1. 

5  CONCLUSION 

In  summary,  the  negative  log  rate  provides  a  charac¬ 
terization  of  densities  based  on  their  tail  behavior.  This 
characteristic  determines  when  a  Bayes  estimate  com¬ 
promises  or  rejects  sources  of  information.  This  knowl¬ 
edge  aids  in  model  selection,  by  knowing  the  conse¬ 
quences  of  our  assumptions  in  the  presence  of  outlying 
data.  Given  a  particular  density  selection,  the  theory 
also  indicates  when  we  could  substitute  alternative  dis¬ 
tributions  that  have  similar  behavior  but  may  be  more 
tractable.  The  Gibbs  Sampling  implementation  allows 
estimation  for  a  prototype  from  each  class. 
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1  Introduction 

High  ozone  concentration  in  the  troposphere  is  believed 
to  be  harmful  to  human  health  and  to  crops  (see  National 
Research  Council  (1991)).  The  surface  ozone  concentration 
level  is  affected  by  the  strengths  of  sources  and  precursor 
emissions,  and  by  meteorological  condition.  To  assess  that 
part  of  the  trend  in  ozone  concentration  levels  that  cannot  be 
accounted  for  by  meteorology,  we  need  to  build  models  which 
relate  ozone  to  meteorology. 

In  Bloomfield,  Royle  and  Yang  (1993),  nonlinear  least 
squares  methods  were  used  to  model  the  dependence  of  o- 
zone  on  meteorology,  and  to  estimate  the  trends.  That  report 
focuses  on  the  urban  Chicago  area. 

In  this  report,  a  semiparametric  modeling  technique  is  used 
to  build  models  that  relate  ozone  to  meteorology. 

2  Semiparametric  Model 

The  ozone  concentration  value  to  be  modeled  here  is  the 
daily  network  typical  value.  To  obtain  the  daily  network  typ¬ 
ical  value,  the  least  absolute  deviations  decomposition  (or  the 
median  polish  decomposition,  see  Tukey  (1977))  of  yd.sj  the 
maximum  concentration  on  day  d  at  station  5,  was  performed 
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for  all  the  45  ozone  monitoring  stations  in  the  urban  Chicago 
area: 

yd,3  =  Id'  Oi'd  +  +  ^'d,s 

The  daily  network  typical  value  is  then  defined  as  /i'  +  a^. 
The  decomposition  was  also  used  to  impute  the  missing  data. 
This  daily  network  typical  value  is  called  the  network  average 
in  Bloomfield  et  al.  (1993).  The  unit  for  ozone  concentration 
is  parts  per  billion  (ppb). 

The  same  meteorological  variables  adopted  by  Bloomfield 
et  al.  (1993)  are  used  here.  The  surface  weather  data  were 
taken  firom  O’Hare  Airport  and  the  upper  air  weather  data 
were  taken  firom  a  station  at  Peoria  in  the  same  period  ozone 
data  were  taken.  The  variables  used  are: 

•  rnavimiim  temperature  from  9:00  am  to  6:00  pm  (maxt) 

•  12  noon  wind  speed  (wspd) 

•  24  hr  ave.  wind  vector  (meanu  and  meanv) 

•  12  noon  relative  humidity  (rh) 

•  12  noon  visibility  (vis) 

•  12  noon  opaque  cloud  cover  (opcov) 

•  7  am  wind  speed  at  700  mb  (wspd700) 

•  24  hr  ave.  temp,  lagged  1  and  2  days  (tlagl  and  tlag2) 

•  24  hr  ave.  wind  speed  lagged  1  day  (wlag) 

•  24  hr  ave.  relative  humidity  lagged  1  day  (rhlag) 

Also  used  is  a  variable  for  year,  which  takes  the  integer  values 
1,2, .. . ,  11,  corresponding  to  years  1981  - 1991,  and  a  variable 
for  day  talring  values  from  1  to  365  to  reflect  seasonal  effects. 

On  day  i,  in  year  j,  with  meteorological  condition  met, 
where  met  is  a  12-dimensional  vector  of  the  above  meteoro¬ 
logical  variables,  let  a:  =  (met,i,j).  So  X  is  a  14-dimensional 
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vector  X  =  (^i, . . . , ^14).  The  response  y{x)  (the  network 
typical  value)  is  assumed  to  be  a  realization  of  a  stochastic 
process,  y(a;): 

Y{met,iO)  =  Pj  +  Z{met,iJ)  -{-  Sij  (1) 

where  13 j  are  constants,  j  =  1,2, Z{x)  = 
Z{7netiiyj)  is  a  zero  mean  Gaussian  process  with  covari¬ 
ance  function  Cov(Z(x)^Z{x'))  =  (7|iZ(rE,x'),  and  £ij  ^ 
^/{0,  all).  See  Sacks,  Welch,  Mitchell  and  Wynn  (1989)  for 
more  discussion. 

Assume,  as  in  Sacks  et  al.  (1989),  that  the  covariance  be¬ 
tween  Z{x)  and  Z{x^)  is 

14 

<tIR{x,x')  =  <r|exp(-  (2) 

k=\ 

wherex  =  (^i, . . .  ,^4),  a;'  =  >  0,  1  < 

PA;  <  2,  =  1, . . . ,  14.  This  class  of  stationary  processes 

provides  us  with  a  wide  range  of  functions. 

Given  the  data  (xi ,  ) ,  (rr2 , 2/2) ,  •  •  • ,  ,  Pn )  for  g  consec¬ 

utive  years  starting  from  year  1  (1981)  with  nj  data  points  in 

year  j  and  ni  H - h  =n  and,  provided  az,  cr^  and  i?(-,  •) 

are  known,  the  best  linear  unbiased  predictor  (BLUP)  y{x) 
at  a  new  point  x  in  year  j  can  be  written  as  (see  Sacks  et  al. 
(1989)) 

y{x)  =  +  Z{x)  =  pj  +  r'ix)C-'{y  -  Fp)  (3) 

where  y  =  (2/1 ,  ^2,  •  •  • ,  3/n),  G  =  Corr(i/)  =  {a%/(T'^)R  + 
{allcp-)!,  where  =  cr|  +  al,  and  R  =  {R{xi,Xj),  1  < 

^  i  £  J  <  n},  the  nxn  matrix  of  correlations  among  Z’s 
at  the  data  points,  t{x)  =  (cr|/(T^)[jR(a:i ,  a;), . . . ,  R{xny  a;)]'. 


and^  =  (A, . .  which  is  the 

usual  generalized  least-squares  estimate  of  ^  =  (/?i , . . . ,  )'. 

In  the  model,  values  of  p  indicate  smoothness  of  the  re¬ 
sponse  surface  as  a  function  of  the  corresponding  variables. 
Larger  values  of  9  usually  indicate  greater  importance  of  the 
corresponding  variables  if  the  variables  are  on  normalized 
scales. 


To  obtain  the  unknown  parameters  azy  cr^,  ^’s  and  p’s, 
maximum  likelihood  estimate  (MLE)  method  is  used.  These 
estimates  are  then  used  in  (3)  to  predict  the  response  surface. 

This  work  focuses  on  the  period  from  May  15  to  Sept.  15, 
the  period  when  ozone  concentration  is  high.  This  period  is 
divided  into  4  smaller  periods:  May  15  -  June  15,  June  15  - 
July  15, July  15 -Aug.  15andAug.  15-Sept.  15.  Thereasons 
are:  First,  the  assumption  of  stationarity  of  Z  within  a  shorter 
time  period  is  more  plausible.  Secondly,  fitting  a  model  for 
each  of  the  4  periods  separately  reduces  the  computational 
burden.  For  more  details,  please  see  Gao,  Sacks  and  Welch 
(1994). 

3  Modeling  the  Network  Typical  Value 

A  model  is  fitted  using  data  from  1981  to  1991  for  each  of 
the  4  periods. 

3.1  Important  Variables 

To  see  which  meteorological  variables  have  strong  effects, 
we  rescale  them  so  that  each  meteorological  variable  ranges 
over  [0,1].  The  MLEs  of  the  6’s  and  p’s  with  the  rescaled 
meteorological  variables  and  the  rescaled  variables  day  and 
year  are  given  in  Table  1. 

The  estimated  9  for  year  was  0  for  the  first  3  monthly  peri¬ 
ods.  For  the  fourth  monthly  period,  the  estimated  9  for  year 
was  small,  indicating  that  year  was  not  an  important  variable. 
Because  the  adjusted  trend  of  ozone  could  be  unambiguously 
interpreted  through  the  Pj's  if  9  for  year  was  0  (see  Section 
3.3),  we  choose  to  set  the  9  for  year  equal  to  0  in  the  fourth 
period  as  well. 

From  the  table,  it  can  be  seen  that  temperature,  relative  hu¬ 
midity  and  wind  (through  wspd,  wlag,  meanu,  meanv  and/or 
wspdTOO)  are  consistently  important  across  the  months.  For 
more  discussion,  please  see  Gao  et  al.  (1994). 

3.2  Quality  of  the  Fitted  Models 

To  check  the  quality  of  the  model  fit,  the  cross  validation 
root  mean  square  error  (CVRMSE)  was  calculated.  If  the 
model  fit  is  good,  the  CVRMSE  should  be  close  to  cr^  or  its 
MLE. 
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Table  2  lists  the  MLEs  of  az  and  and  the  CVRMSEs 
for  the  fitted  models.  The  table  shows  that  the  model  fits 
are  generally  good.  The  values  of  CVRMSE  are  close  to 
the  values  of  root  mean  square  residual  from  the  parametric 
model  fitting  in  Table  6  of  Bloomfield  et  al.  (1993).  For  more 
discussion,  please  see  Gao  et  al.  (1994). 

3.3  Trend  Estimation 

It  is  possible  to  interpret  the  adjusted  trend  through  the 
/Jj’s  in  the  model  when  the  variable  year  does  not  appear  in 
the  stochastic  process  part  of  the  model  Z(-),  or  equivalently 
when  0  for  year  is  0.  Under  this  circumstances,  if  met  is 
held  fixed,  the  change  from  year  to  year  is,  except  for  random 
errors  e,  reflected  in  the  differences  of  the  ’s.  Therefore  the 
adjusted  trend  is  defined  as  the  trend  in  the  )9j’s. 

Let  0^  =  pj  4-  (y  -  /^),  then  P*  =  y.  These  pys  can 
be  interpreted  as  the  adjusted  (for  meteorology)  averages  of 
ozone  level  across  the  years  while  the  simple  yearly  averages 
yj's  are  the  unadjusted  averages.  The  time  series  plots  in 
Figure  1  demonstrate  that  a  large  portion  of  the  variability  in 
the  unadjusted  averages  is  eliminated  in  the  adjusted  averages. 
This  portion  of  the  variability  is  caused  by  meteorology.  The 
plots  suggest  a  linear  trend  for  the  adjusted  averages.  The 
lines  in  the  plots  are  the  least  square  regression  lines.  Let  a  be 
the  intercept  at  year=  81  and  b  be  the  slope  of  the  line,  then 
the  estimate  of  the  adjusted  trend  is 

trend  =  10  x  ^  (%/decade).  (4) 

a 

Based  on  the  model  and  using  MLEs  of  the  parameters,  the 
standard  errors  of  the  estimates  of  the  trend  can  be  estimated. 
The  standard  errors  of  the  estimates  of  the  trend  can  also  be 
estimated  by  jackknifing  by  day  (see  Chapter  8  of  Mosteller 
and  Tukey  (1977)).  Also  see  Gao  et  al.  (1994)  for  more  details. 
The  estimates  of  the  trends  and  their  standard  errors  are  listed 
in  Table  3. 

3.4  Predictions 

The  models  constructed  can  be  used  to  predict  behavior 
of  ozone  in  future  years  as  a  function  of  meteorology.  The 
results  in  Gao  et  al.  (1994)  show  that  the  model  predictions 
closely  match  the  actual  ozone  levels. 


4  Conclusions 

The  semiparametric  modeling  technique  is  shown  to  pro¬ 
vide  a  good  way  to  model  the  ozone  concentration  as  a  func¬ 
tion  of  meteorology.  This  method  can  be  used  to  assess  the 
adjusted  trends.  The  models  can  also  be  used  to  predict  ozone 
levels  from  meteorology. 

It  is  found  that  for  the  urban  Chicago  area,  there  are  signif¬ 
icant  downward  trends  for  the  network  typical  ozone  values 
after  adjusting  for  meteorology  for  the  periods  June  15  -  July 
15  and  Aug.  15  -  Sept.  15  over  the  11  years  studied  (see 
Table  3). 

In  Bloomfield  et  al.  (1993),  for  the  period  of  Apr.  1  - 
Oct.  31,  the  adjusted  trend  for  the  network  typical  values  is 
found  to  be  -2.7%/decade  with  a  (jackknife)  standard  error 
of  3.4%/decade.  Results  from  the  two  reports  appear  to  be 
consistent. 
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Table  1 :  Estimates  of  9's  and  p's  for  models  for  the  network  typical  value  with  meteorological  variables  rescaled. 


variables 

May  15  - 

e 

June  15 

V 

June  15  - 

e 

July  15 

P 

i  July  15- 
6 

Aug.  15  Aug.  15  - 
p  0 

Sept.  15 

P 

maxt 

3.3222 

2 

4.8540 

2 

0.8202 

2 

1.7082 

2. 

wspd 

0.2368 

2 

0.0000 

2 

0.3261 

2 

0.0000 

2 

meanu 

0.0000 

2 

0.7052 

1.185 

0.0733 

1.435 

0.3849 

1 

meanv 

0.0000 

2 

1.4913 

2 

0.5097 

2 

1.6454 

2 

rh 

1.7655 

2 

0.7756 

2 

0.5506 

2 

1.9768 

2 

vis 

0.1833 

2 

1.1844 

2 

0.0333 

2 

0.6480 

2 

opcov 

0.0000 

2 

0.0731 

2 

0.0000 

2 

0.0617 

2 

wspd700 

0.5513 

2 

0.0000 

2 

0.0170 

2 

0.1922 

2 

tlagl 

0.0000 

2 

0.5948 

2 

0.0000 

2 

0.0000 

2 

tlag2 

0.0928 

2 

0.0000 

2 

0.0462 

2 

0.2011 

2 

wlag 

0.0481 

1 

0.6122 

2 

0.0000 

2 

2.2078 

2 

rhlag 

0.0000 

2 

0.3031 

2 

0.0000 

2 

0.1955 

2 

day 

0.0939 

2 

0.1270 

2 

0.0000 

2 

0.1172 

2 

year 

0.0000 

2 

0.0000 

2 

0.0000 

2 

0.0000 

2 

Table  2:  Estimates  of  az  and  0-^,  and  CVRMSE  for  models  for  the  network  typical  value. 
Models  az  CVRMSE 


May  15  -  June  15 

17.051 

7.028 

7.606 

June  15  -  July  15 

17.584 

7.028 

8.433 

July  15  -  Aug.  15 

33.194 

9.236 

9.828 

Aug.  15  -  Sept.  15 

15.263 

6.281 

7.680 

Table  3:  Estimates  of  trends  and  their  standard  errors  for  the  adjusted  averages  for  the  network  typical  value. 


Models 

Model  Estimates 

Trend  Standard  Error 

t  Value 

Jackknifed  Estimates 

Trend  Standard  Error  t  Value 

May  15  -  June  15 

0.0139 

0.0280 

0.4964 

0.0149 

0.0288 

0.5174 

June  15  -  July  15 

-0.0635 

0.0285 

2.2281 

-0.0651 

0.0287 

2.2683 

July  15 -Aug.  15 

-0.0074 

0.0313 

0.2364 

-0.0093 

0,0310 

0.3000 

Aug.  15 -Sept.  15 

-0.1094 

0.0330 

3.3152 

-0.1146 

0.0290 

3.9517 
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SYMPOSIUM  SESSION  SCHEDULE 


Thursday  June  16, 1994 

8:15  a,m.  -  9:45  a.in. 

Keynote  Session 

•  Gauss,  Statistics,  and  Gaussian  Elimination 

10:15  a.ni,  - 12:00  p.m. 

Issues  in  Software 

•  Software  as  Property 

•  Developing  Interactive  Graphics  in  C++ 

•  Parallel  Computing  And  Statistics 

10:15  a.m.  - 12:00  p.m. 

Fast  Implementations  of  Smoothers 

•  Fast  Implementations  of  Nonparametric  Curve 

•  Estimation  and  Presentation  of  Regression  in 
Several  Variables  via  Warping  and  the  ASH 

•  Fast  and  Stable  Computation  of  Local 
Polynomials 

•  Fast  Implementations  of  Average  Derivative 
Estimation 

10:15  a.m.  - 12:00  p.m. 

Longitudinal  and  Mixed  Models, 

•  Estimation  Methods  for  Nonlinear  Mixed- 
Effects  Models 

•  Experiences  with  Derivative-Free  REML  for 
Large,  Messy,  Multiple  Trait  Genetic  Models 
to  Estimate  Variances  and  Covariances 

•  Generalized  Estimating  Equations  and 
Extensions  for  Various  Clustered  Data 
Structures 

10:15  a.m.  - 12:00  p.m. 

Contributed  Papers  1:  Experimental  Design 

•  Dual  Space  Algorithms  for  Designing  Space 
Filling  Experiments 

•  Experiment  Design  for  Assessment  of 
Important  Inputs  to  a  Computer  Code 

•  Computations  in  a  Finite  Projective  Geometry 
for  Enumeration  of  Subdesigns 

•  Sampling  Plans  on  the  Sphere 


10:15  a.m.  - 12:00  p.m. 

Contributed  Papers  2:  Fractal,  Neural,  other 

•  Incorporating  Segmentation  Boundaries  into 
the  Calculation  of  Fractal  Dimension  Features 

•  Overfitting  in  Neural  Networks 

•  Likelihood  Profiles  for  Studying  Non  - 
Identifiability 

•  A  Method  for  Estimation  of  Parameters  of  the 
Keeney  &  Raifa  Utility  Models  Based  on  the 
Normal  Logistic  Functions 

•  Statistical  Fitting  of  Financial  Models 

12:45  p.m.  - 1:30  p.m. 

POSTER  SESSIONS 

•  On  Calculating  the  Distribution  of  Independent 
Trials  with  Changing  Probabilities  of  Success 

•  Bayesian  Estimation  Using  the  Gibbs  Sampler 
for  the  Inhibition/Promotion  Cancer 
Chemoprevention  Experiment 

•  Computationally  Intensive  Statistical  Methods 
for  Quality  Control 

•  Interval  Analysis  and  Self- Validating 
Computation  of  Non-Central  F  Probabilities 
and  Percentiles, 

•  An  Algorithm  for  Fitting  and  Displaying 
Distribution  Data 

•  MCMC  Methods  When  There  Is  Partial 
Exchangeability 

•  Graphically  Comparing  Two  Similarity 
Measures  Defined  over  Large  Databases 

•  Robust  Empirical  and  Hierarchical  Bayes 
Estimation  of  Normal  Means  and  Rates  in 
Longitudinal  Studies 

1:30  p.m.  -  3:15  p.m. 

Green  Thumbs:  Extensions  and  Applications  of 

Tree  Modeling  Methods, 

•  The  Art  of  Growing  Classification  Trees 

•  Trees  for  Event  Rate  Data 

•  Hybrid  Trees 
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1:30  p.m.  -  3:15  p.m. 

Bayesian  Curve  Fitting 

•  Gibbs  Sampling  schemes  for  Bayesian  Density 
Estimation  with  Mixtures 

•  Nonparametric  Additive  Regression  with 
Autocorrelated  Errors 

•  Issues  in  Bayesian  Analysis  of  Neural 
Networks 

•  Wavelets  and  Bayesian  Data  Analysis 

1:30  p.m.  -  3:15  p.m. 

Space  Filling  Experimental  Designs:  Theory, 

Computer  Construction,  and  Analysis 

•  Introduction  to  Space  Filling  Designs 

•  Algorithms  and  Uses  of  Space  Filling  Designs 

•  Analysis  of  Space  Filling  Designs 

1:30  p.m.  -3:15  p.m. 

Contributed  Papers  3:  Longitudinal 

•  A  Monte  Carlo  E  M  Algorithm  for  Some 
Grouped  and  Partially  Observed  Data  Models 
with  Random  Effects:  Ordinal  Probit,  Censored 
Regression  and  Tobit  Models 

•  A  Randomization  Test  for  Diverging  Trends  in 
Longitudinal  Data 

•  Linearizing  Transformations  in  Growth-curve 
Problems 

•  An  EM  Algorithm  Fitting  First-Order 
Conditional  Autoregressive  Models  to 
Longitudinal  Data 

•  REML  in  Generalized  Linear  Models:  A 
Conditional  Approach 

1:30  p.m.  -  3:15  p.m. 

Contributed  Papers  4:  Computing 

•  Random  Integration  Rules  for  Statistical 
Computation 

•  Using  PVM  on  Computation  for  Analysis  of 
Repeated  Measurement  Designs 

•  Large  Visualizing  Time-Stamped  Log  Files 

•  The  Multi-String  Rearranging  Memory  and  Its 
Use  in  Statistical  Computing 

•  Fast  Multidimensional  Density  Estimation 
based  on  Random-width  Bins 

3:45  p.m.  -  5:30  p.m. 

Panel:  Statistics  Education  in  the  Computer 

Age 


3:45  p.m.  -  5:30  p.m. 

Contributed  Papers  5:  Enhancements  to  Tree 

Algorithms 

•  Multivariate  Split  Classification  Trees 

•  Global  Tree  Optimization:  a  non-greedy 
decision  tree  algorithm 

•  Growing  Decision  Trees  less  Greedily 

•  Tree  Structured  Density  Estimation 

•  Tree-Structured  Multivariate  Density 
Estimation  and  Its  Application  In 
Environmental  Modeling 

3:45  p.m.  -  5:30  p.m. 

Contributed  Papers  6:  Multiple  Comparisons 

•  SIMMCOMP:  an  Splus  Module  for 

Simultaneous  Inference 

•  Classical  Multiple  Comparison  via  Naiman's 
Inequality  From  Hypercubes  to  Permutation 
Polytopes 

•  On  The  Analysis  of  Multiple  Correlated  Binary 
Endpoints  in  Medical  Studies 

3:45  p.m.  -  5:30  p.m. 

Contributed  Papers  7:  Smoothers  and 

Nonparametric  Regression 

•  On  Partial  Cross  Validation  in  Nonparametric 
Regress 

•  An  Iterative  Projection  Method  for 
Nonparametric  Additive  Regression  Modeling 

•  Nonparametric  Curve  Estimation  from  Indirect 
Observations 

•  Open  Questions  in  the  Application  of 
Smoothing  Methods  to  Finite  Population 
Inference 

•  Empirical  Examination  of  an  Efficient  Robust 
Linear  Regressor 

3:45  p.m.  -  5:30  p.m. 

Tutorial:  Introduction  to  Perl,  for  Statisticians 


Friday  June  17, 1994 

8:15  a.m.  -  9:45  a.m. 

Neural  Nets 

•  The  Accuracy  of  Bayes  Estimators  of  Neural 
Nets 

•  Using  neural  networks  to  estimate  functions 

•  Generalization  and  Exclusive  Allocation  in 
Unsupervised  Category  Learning 
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8:15  a.m.  -  9:45  a.ni. 

Contributed  Papers  8:  Density  Estimation 

•  Jump  and  Sharp  Detection  by  Wavelets 

•  Numerical  Techniques  in  Distribution  Fitting 

•  Maximum  Likelihood  Density  Estimation  with 
Term  Creation  and  Annihilation 

•  The  Bias-Optimized  Frequency  Polygon 

•  Data  Adaptive  Density  Estimation  of  DNA 
Distributions 

8:15  a.m.  -  9:45  a.m. 

Contributed  Papers  9:  Numerical 

•  Parseval  Quadrature  for  Normal  Tail 
Possibilities 

•  A  Method  of  the  Computation  of  Multivariate 
Normal  Probabilities  Over  Any  Convex 
Region 

•  Computer  Random  V ariate  Generation  for 
Multinomial  Distribution 

•  Systematic  Random  Leapfrog  Method  for 
Parallel  Random  Number  Generators 

•  Efficient  Programs  for  Simulating  Chi-bar 
Square  Distributions 

8:15  a.m.  -  9:45  a.m. 

Contributed  Papers  10:  Trees  II 

•  The  Cumulative  Score  Control  Chart  for  an 
Open  Loop  Control 

•  Piecewise  Proportional  Hazards  Survival  Trees 
With  Time  -Dependent  Covariates 

•  A  Tree-Based  Method  of  Analysis  for 
Prospective  Studies 

vTesting  in  High  Dimensional  Spaces  via 
Recursive  Partitioning 

•  Tree  Based  Classification  Using  a  Predictor 
with  Many  Categories 

8:15  a.m.  -  9:45  a.m. 

Tutorial:  Networking  Innovations  and 

Resources,  The  Internet  as  Toolbox 

10:15  a.m.  - 12:00  p.m. 

Nonparametric  Regression  for  Edge  and  Peak 

Preserving 

•  Cube  splitting  for  multidimensional  edges 

•  Discontinuity  estimation  in  nonparametric 
regression  via  orthogonal  series 

•  Semiparametric  Change-Point  Methods 

•  Nonparametric  autoregression-regrpsion  for 
edge  preserving;  The  estimate  and  its 
application  in  computer  vision 


10:15  a.m.  ■  12:00  p.m. 

Software  for  MetaAnalysis 

•  Epi-meta:  Meta-analytic  statistical  software  for 
epidemiological  studies 

•  Performing  meta-analyses  using  commercial 
mixed-model  software 

•  Software  for  meta-analysis;  a  comparative 
review 

10:15  p.m.  - 12:00  p.m. 

Special  Contributed  Papers  11:  Visual 

Statistical  Analysis 

•  Visually  Guided  Statistical  Analysis 

•  Visual  Sensitivity  Analysis  for 
Multidimensional  Scaling 

•  Visual  Correspondence  Analysis 

•  Visual  Log-Linear  Analysis 

•  Visualizing  High-Dimensional  Space  with 
Principal  Components  Analysis 

10:15  p.m.  - 12:00  p.m. 

Contributed  Papers  12:  Gibbs  Samplers 

•  Monte  Carlo  Assessment  of  Influence  and 
Sensitivity  in  Bayesian  Modeling 

•  Bayesian  Inference  for  Nonlinear  Regression 
with  Covariate  Measurement  Error  via  Gibbs 
Sampling 

•  BUGS  (Bayesian  Inference  Using  Gibbs 
Sampling) 

•  Using  the  Gibbs  Sampler  to  Detect 
Changepoints;  Application  to  Longitudinal 
Markers  of  Disease 

•  Applied  Convergence  Diagnostics  for  tbe 
Gibbs  Sampler 

10:15  p.m.  - 12:00  p.m. 

Contributed  Papers  13:  Computing 

•  Statistical  Inference  for  Priority  Queues 

•  On-Line  Control  of  Stochastic  Systems; 
Application  to  the  Design  of  an  Artificial 
Pancreas 

•  Using  both  Symbolic  and  Classical  Methods  to 
Analyze  Complex  Data  Set  with  the  SAS 
System 

•  Statistical  Methods  in  Software  Engineering 

1:30  p.m.  -  3:15  p.m. 

Wavelets  Tutorial 
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1:30  p.m,  -  3:15  p,in. 

Smart  Monte  Carlo  Methods  for  Conditional 

Inference  in  Exponential  Families 

•  Approximate  conditional  inference  in 
exponential  families  via  the  Gibbs  sampler 

•  Saddlepoint  approximations  for  the  likelihood 
ratio  statistic  in  exponential  families 

•  Monte  Carlo  sampling  from  exponential 
families  under  linear  constraints 

1:30  p.m.  -  3:15  p.m. 

Stochastic  Modeling  In  Carcinogenesis 

•  Computational  issues  in  analyzing 
premalignant  liver  lesions 

•  Multi-pathway  multistage  models  of 
carcinogenesis 

•  Time-dependent  rates  in  interconnected  birth - 
death  models 

1:30  p.m.  -  3:15  p.m. 

Contributed  Papers  14:  Multivariate 

•  Stability  of  Homogeneity  Analysis 

•  Estimation  of  Covariance  Matrices  Using 
Eigenstructure  Influence 

•  Finding  the  Minimum  Volume  Ellipsoid 

•  Triangulation  and  Multivariate  Nonparametric 
Function  Estimation 

1:30  p.m.  -  3:15  p.m. 

Contributed  Papers  15:  Software 

•  Data  Conversion  Pitfalls 

•  Design  of  Object-Oriented  Functions  in  S  for 
Screen  Display,  Interface  and  Control,  of  Other 
Programs 

•  LISP  for  Interval  Computations 

•  Documentation  with  Online  Programs  Rather 
Than  Programs  with  Online  Documentation 

•  What  is  the  Most  Appropriate  Software  for  a 
Statistics  Course?,  John  D.  McKenzie,  Jr.,  and 
William  H.  Rybolt,  Babson  College 

3:45  p.m.  -  5:30  p.m. 

Panel  of  Editors  of  Journals  for  Statistical 

Computing 


3:45  p.m.  -  5:30  p.m. 

Robust  Regression  and  Multivariate  Analysis 

•  Using  Multiple  Processors  to  Compute  Robust 
Regression  Estimators 

•  Identification  of  Outliers  in  Multivariate  Data 

•  Robust  Model  Comparison  for  Autoregressive 
Processes  with  Robust  Bayes  Factors 

3:45  p.m.  -  5:30  p.m. 

Applications  of  Wavelets 

•  S+WAVELETS:  An  Object-Oriented  Wavelet 
Toolkit 

•  The  Use  of  Wavelets  for  Spectral  Density 
Estimation  With  Local  Bandwidth  Adaptation 

•  An  Application  of  Wavelets  to  Tomography 

•  Use  of  Wavelets  for  Denoising  and  Feature 
Enhancement  in  Mammograms 

3:45  p.m.  -  5:30  p.m. 

Contributed  Papers  16:  Genetics 

•  A  Composite  Model  for  the  Distribution  of 
Species  and  Its  Use  in  Monitoring  Pattern 
Recognition  Algorithms 

•  Some  Computational  Problems  in  Modeling 
Molecular  Evolution 

•  Computation  of  Identity-by-Descent 
Proportion  for  Pedigree  Data 

•  Using  S  for  a  Bayesian  Analysis  of  Cleavage 
Sites  When  the  Amino  Sequence  in  a  Peptide 
Is  Known 

•  Inference  for  Lethal  Gene  Studies  via  Bayesian 
Markov  Chain  Simulation 

3:45  p.m,  -  5:30  p.m. 

Contributed  Papers  17:  Bootstrap 

•  Aggregation  Coefficients  of  Clustering  in 
Databases  and  Metric  Spaces 

•  On  a  Nearest  Neighbours  Oriented  Algorithm 
for  Missing  Data  Reconstitution-Application  to 
a  Magamatic  Data  Array 

•  Efficient  Computation  of  Statistical  Procedures 
Based  on  Subsetting  the  Observations 

•  A  Frequency  Domain  Bootstrap  for  Time 
Series 
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Saturday  June  18, 1994 

8:15  a.m.  -  9:45  a.m. 

Tutorial:  Markov  Chain  Monte  Carlo  in 

Bayesian  and  Likelihood  Statistics 

8:15  a.m. -9:45 

Statistics  of  Protein  and  Macromolecular 

Structures 

•  Threading  protein  sequences  through  folding 
motifs 

•  The  Inverse  Folding  Problem:  Analysis  by 
Statistical  and  Machine  Learning  Methods 

•  A  Gibbs  sampling  algorithm  for  the 
identification  and  characterization  of  a 
structural  motif  in  a  data  base  of  bioploymer 
sequences. 

8:15  a.m.  -  9:45  a.m. 

Contributed  Papers  18:  Graphics 

•  Visualizing  the  Destructive  Potential  of 
Indirect  Fire  Weapons 

•  A  Robust  Visual  Access  and  Analysis  System 
for  Very  Large  Multivariate  Databases 

•  Dynamic  Graphics  in  a  GIS:  A  Link  between 
Arc/Info  and  XGobi 

•  Variations  on  Row-labeled  Plots 

•  Data  Analysis  with  Graphical  Models 

8:15  a.m.  -  9:45  a.m. 

Contributed  Papers  19:  Nonparame tries 

•  Relative  Power  of  Smirnov  and  Wilcoxon 
exact  tests  in  two-sample  . 

•  A  Simulation  Study  of  Some  Rank  Tests  for 
Interaction  in  Two-Way  Layouts 

•  Computation  of  the  Wilcoxon‘s  T(n), 
Wilcoxon's  W(m,n)  and  Ansari-Bradley's 
A(m,n)  Statistics  When  the  Sample  Size  is 
Small 

•  NonParametric  Estimation  of  Functions  from 
Stratified  Samples 

10:15  a.m.  - 12:00  p.m. 

Efficient  Bootstrap  Computations 

•  Concomitants  of  order  statistics  for  bootstrap 
distribution  estimation 

•  Fast  and  accurate  approximate  double 
bootstrap  confidence  intervals 

•  Saddlepoint  Control  Variates  and  Importance 
Sampling 


10:15  a.m. « 12:00  p.m. 

Convergence  of  Markov  Chain  Samplers 

•  Efficient  Random-Walk  Metropolis 
Algorithms 

•  Theoretical  rates  of  convergence  for  Markov 
chain  Monte  Carlo 

•  The  fraction  of  missing  information  and 
convergence  rate  for  data  augmentation 

10:15  a.m.  - 12:00  p.m. 

Computational  Techniques  in  Genetics  and 

Molecular  Biology 

•  Monte  Carlo  Estimation  of  Autozygosity 
Probabilities,  Elizabeth  Thompson,  Univ  of 
Washington; 

•  Statistical  and  Computational  Challenges  in 
Physical  Mapping,  David  Nelson  &  Terence 
Speed,  Berkeley; 

•  Bayesian  Restoration  of  a  Hidden  Markov 
Chain  with  Application  to  Sequence 
Alignment, Gary  Churchill,  Cornell. 

10:15  a.m.  - 12:00  p.m. 

Contributed  Papers  20:  Robust 

•  Regression  Hazards  Model  with  Markov 
Process 

•  A  Principal  Components  Based  Algorithm  for 
Variable  Selection  in  Linear  Models 

•  Saddlepoint  Approximations  for  Robust  M 
Regression 

•  Rank  Cusum  Test  for  Change  in  the  Mean 

10:15  a.m.  - 12:00  p.m. 

Contributed  Papers  21:  Parametric  Modeling 

•  Perturbation  Bounds  for  Linear  Regression 
Problems 

•  Tailoring  Nonlinear  Least  Squares 
Algorithms  for  the  Analysis  of  Compartment 
Models 

•  Characterizing  Hierarchical  Model  Behavior 

•  Predicting  the  Urban  Ozone  Levels  and  Trends 
with  Semiparametric  Modeling 


